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Projection of a Point onto a Convex Set ®) 
via Charged Balls Method cen 


Majid E. Abbasov 


Abstract Finding the projection of a point onto a convex set is a problem of 
computational geometry that plays an essential role in mathematical programming 
and nondifferentiable optimization. Recently proposed Charged Balls Method 
allows one to solve this problem for the case when the boundary of the set 
is given by smooth function. In this chapter, we give an overview of Charged 
Balls Method and show that the method is applicable also for nonsmooth case. 
More specifically, we consider the set that is defined by a maximum of a finite 
number of smooth functions. Obtained results show that even in case of the set 
with nonsmooth boundary, Charged Balls Method still solves the problem quite 
competitive effectively in comparison with other algorithms. This is confirmed by 
the results of numerical experiments. 


1 Introduction 


One of the most transparent and clear ways of constructing optimization algorithms 
is based on the physical analogies. Initial optimization problem is replaced by a 
mechanical system. This system tends to an equilibrium point that coincides with 
the solution of the original problem. Then differential equations of motion are 
derived. By implementing difference scheme (numerical procedure) of solving these 
equations, an iterative method for the original problem is obtained. One can mention 
Snyman-—Fatti algorithm [1] or Heavy Ball Method [2-8] as representatives of such 
class of algorithms (see also [9]). 
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Charged Balls Method [10] belongs to the same class. This is the recently 
proposed algorithm for solving various computational geometry problems such as 
finding the minimum distance from a point to a convex closed set with a smooth 
boundary, finding the minimum distance between two such sets and others. The 
main aim of this chapter is to give an overview of Charged Balls Method and adapt 
it for nonsmooth case. To do so, we show how the problem of projection of a point 
onto a convex closed set with a nonsmooth boundary can be solved via Charged 
Balls Method. 

This chapter is organized as follows: in Sect.2, we give a brief overview of 
Charged Balls Method and give results on convergence of the method. Also here we 
describe the most successful modification of Charged Balls Method that we apply 
in what follows. Section 3 discusses the procedure of substitution of the nonsmooth 
problem with its smooth analogue. Numerical experiments are provided in Sect. 4. 


2 Charged Balls Method and its Modification 


Consider the problem of finding the orthogonal projection of the origin onto a set 
X = {x € R"| f(x) <0}, where f: R” — R is a convex function of class C’. 
We also assume that 0 ¢ X and Vf (x) ¥ 0 for all x € bd X, where bd X is the 
boundary of X. The problem is stated as follows: 


(1) 


||x || —> min, 
xe x, 


It was demonstrated in [10] that the solution coincides with the equilibrium point 
of the system 


{s: (2) 
Z= piw(x) — poz — x(x, Zz) 
where 
1 V A 
W(x) = GT a gta ges 


IP Ie PIVF@OP IV f CII? 


and pi, p2 are some parameters. This system can be considered as a continuous 
variant of Charged Balls Method. The convergence of these methods is equivalent 
to the asymptotical stability in Lyapunov’s sense. 

Therefore, via Barbashin—Krasovskii—LaSalle principle, the following result can 
be proved. 
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Theorem 1 (See [10]) Let problem (1) have a finite number of stationary points, 
and & be the solution of (1). Then there exists such 5 > 0 that, for any [x°, 2°] from 
the intersection of the ball of a radius 6 centered at [X, 0], 


Bs ([x, 0]) = {[x, 2] | IIx, z] — fx, OJ] < 5}, 


with set W = {w € R” | w= [x,z], x € bd X, (V f(x), z) = 0}, method (2) with 
the initially conditions [x°, z°| converges to [X, 0]. 
Here [x,z] is a vector from R7" obtained by concatenation of n-dimensional 


vectors x and z. 


It can be shown that the point x is the equilibrium of system (2) iff w(x) = 0. 
Consequently, the inequality ||w(x)|| < ¢, where ¢ > 0 is a small constant, can be 
used as the stopping criteria. Hereof, it is clear that the next results provide estimate 
for the rate of the convergence of Charged Balls Method. 


Theorem 2 (See [10]) Assume that all the conditions of Theorem 1 are satisfied. 
Let p and pp be the parameters of the method, and let the Jacobian matrix w’' (x) 
satisfy the inequality 


2 
(W'@)2.2) = HIP 
P\ 


for all [x,z] € Be({X,0])(\W. Then for any initial conditions, [x°, 2°] € 
Bs([x, 0]) (| W for method (2) holds the following estimate 


; V(0O) —v 

2 * 

t =< __=,, 
po lveOw Ss nit 
where 
_ P22 _ . 
V(x, 2) = —(W), 2) — s—MIzll°, Ue = min V(r, z). 
2P1 [x.z]eBe([X,0) 1 W 


To get an iterative algorithm (discrete scheme) for solving the system (2), one 
can apply any numerical procedure for solving ordinary differential equations. The 
most simple result can be obtained via Euler method. 

Choose a small positive 5 for the step size, x9 = x(0), zo = z(0) = 0. Assume 
we have x,z_1 and zz_1. Then 


Xk = Xe—-1 + bZK-1, (3) 
Zk = Ze-1 + 6 (W(XK-1) — poze-1 — X(X, Zk-1))- 


In [11], different modifications of Charged Balls Method were considered in 
order to get the fastest variant of the method. The so-called speed zeroing idea 
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showed the most promising result. Describe this algorithm with a constant step 
5 > 0: 


Xk = x, + dz, 

— fF G41) 
IV f Orel? 

Ze = WR). 


Xk+1 = Xk+1 Vf (X41), (4) 


It must be noted that this modification is of the first order. Therefore, the 
local projection procedure at each iteration is mandatory here. This procedure is 
represented by the second equation in (3). 

We can also use backtracking idea to adjust step size at each iteration. Choose a 
small ¢ > 0, an initial step i> 0, some A € (0, 1). Let x, and z, be given. Assume 
6 =6. 


1. Check the inequality || (xx) || > |W + 6zx)|I. 
a. If it is satisfied, build 
Xk+1 = XK + zx, 
f Xk+1) 


IV f Geel? 
Zetl = WrR+1), 


Xko1 = Xk+1 = Vf (Xk+1), 


> 


and go to the point 2 of the algorithm. 
b. If it is not satisfied, shrink the step 6 = Aé and go to the point 1 of the 
algorithm. 


2. Check the stopping criteria 


Iw ort] < e. (5) 


a. If it is true, take xz) as the solution and finish computations. 
b. If it is not true, take 6 = 6, increase k by one, and go to the point | of the 
algorithm. 


It should be noted that reduction of 5 stops at some moment since zz is a descent 
direction for the function ||w(x)|| at the point xz. 

This modifications showed (see [11]) better results especially in high- 
dimensional cases than other methods such as: 


e Exterior penalty function method (see [12—14]) 
¢ Geometric balls method specially developed for the problem (see [15]) 
¢ Ellipsoid method (see [16, 17]) 


In what follows, we will use modification (4) of the Charged Balls Method. 
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3 Case of the Set with Nonsmooth Boundary 


Let us consider the set X = {x € R” | f(x) < 0}, where 
f(x) = max gj (x), 
iel 


I =1,...,misa finite index set, and ¢;(x): R” — R are smooth convex functions 
for alli in J. 

Our aim is to adapt Charged Balls Method for finding the projection of the 
origin onto a convex set X. We will follow the path of building such a smooth 
approximation of the function f (x) that allows us to solve the problem at any needed 
accuracy. 


Proposition 1 Let the function 


f(x) = max gj (x), 
iel 


where I = 1,...,m and $;(x): R” — R be smooth convex functions for alli in I. 
Then for the function 
| adi (x) 
fa(x) = In) eX, (6) 
iel 


where a > 0 is a positive finite constant holds the inequality 


Inm 
f(x) S fale) < f(x) + ere (7) 


Proof We have 


1 : 1 
| otf (x) 00 ($j (x)— f(x) } _] api (FO) | 
Su (x) a o(: ) e f@+ 7 n y e 


iel iel 


At least for one i € J holds the equation ¢; (x) = f(x), wherefrom follows the left 
side of the double inequality (7). Furthermore, since 


giix)— fa) <0 Wiel 


we can conclude that 


1 


fal) = 7 Oy +S Inte. 
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Inequality (7) implies that if we apply Charged Balls Method for the set defined 
by the function f(x), we can provide movement outside the set X (since f(x) < 
fu (x)) in any needed neighborhood of the boundary (by choosing sufficiently large 
a). 


4 Numerical Experiments 


In this section, we provide results of numerical experiments for the problem of 
projection of the origin onto a convex set with nonsmooth boundary. We solve this 
problem by formulating its smooth version and further application of speed zeroing 
modification of Charged Balls Method (4). We do not carry out comparison with 
other competitor methods since this was already done in [11]. Therefore, such a 
comparison would only repeat results obtained there. The aim of this section is to 
show how the approach proposed in the current study works. 


Example 1 Consider a function f : R* > R, f(x) = max hj (x), where 
i=1, 


hy(x) = (x1 — 5)? + (x2 — 4)* — 25, 
ho(x) = («1 — 5)* + (x2 +4)? — 25. 


Let us find the projection of the origin onto a set X = {x € R* | f(x) < 0} (see 
Fig. 1). 

Take x9 = (5, 1)! as an initial point, 6 = 0.1 for the step size, and ¢ = 0.01 
for the stopping criteria (5). Below the results of algorithm (4) for a different 
smoothening parameter a (see (6)) are given. Computations were made in MATLAB 
R2018b platform (Table 1). 


Fig. 1 Illustration of set X 
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Table 1 Results for the function in Example |. Here a is a smoothening parameter, k is a number 
of iterations, t is a time in seconds, X, is the obtained solution, and x,, = (2, 0)" is the true solution 


a k t Xx. — Xx I 
10 14360 0.0143 7) 116. 10-4 
20 436 0.0142 57- 10-4 
30 448 0.0193 1.8- 10-4 


Table 2 Results for the function in Example 2. Here a is a smoothening parameter, k is a number 


of iterations, t is a time in seconds, X,, is the obtained solution, and x, = (2,0,..., oy)? is the true 
solution 

a i re IE — el 
10 877 0.0802 0.0116 

20 874 0.0811 0.0058 

30 870 0.0813 0.0014 


Example 2 Consider a function f : R!9°° + R, f(x) = max h;(x), where 
i=l, 


1000 
hy(x) = (41 — 5)? + 2-4)? + > x? — 25, 
i=3 


1000 
ha(x) = (@1 — 5)? + (a +4) + D> x? — 25. 
i=3 


Let us find the projection of the origin onto a set X = {x € R!°° | f(x) < 0} 
(Table 2). 

Take x9 = (5,1,0,..., 0)? as an initial point, 6 = 0.1 for the step size, and 
€ = 0.05 for the stopping criteria (5). Below the results of algorithm (4) for a 
different smoothening parameter w (see (6)) are given. Computations were made in 
MATLAB R2018b platform. 


5 Conclusion 


We demonstrated that Charged Balls Method is a quite effective tool for solving 
various computational geometry problems. Idea behind this method is productive 
and allows one to build different modifications in order to get specific problems. 


Acknowledgments Results in Sects.3 and 4 were obtained in the Institute for Problems in 
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Towards Optimal Sampling for Learning M®) 
Sparse Approximations in High ee 
Dimensions 


Ben Adcock, Juan M. Cardenas, Nick Dexter, and Sebastian Moraga 


Abstract In this chapter, we discuss recent work on learning sparse approximations 
to high-dimensional functions on data, where the target functions may be scalar-,’ 
vector- or even Hilbert space-valued. Our main objective is to study how the 
sampling strategy affects the sample complexity—that is, the number of samples 
that suffice for accurate and stable recovery—and to use this insight to obtain 
optimal or near-optimal sampling procedures. We consider two settings. First, when 
a target sparse representation is known, in which case we present a near-complete 
answer based on drawing independent random samples from carefully designed 
probability measures. Second, we consider the more challenging scenario when such 
representation is unknown. In this case, while not giving a full answer, we describe a 
general construction of sampling measures that improves over standard Monte Carlo 
sampling. We present examples using algebraic and trigonometric polynomials, and 
for the former, we also introduce a new procedure for function approximation on 
irregular (i.e. nontensorial) domains. The effectiveness of this procedure is shown 
through numerical examples. Finally, we discuss a number of structured sparsity 
models and how they may lead to better approximations. 


1 Introduction 


Learning an accurate approximation to an unknown function from data is a 
fundamental problem at the heart of many key tasks in applied mathematics and 
computer science. This problem is rendered challenging by the famous curse of 
dimensionality. In many relevant applications, the domain of the function is a 
high-dimensional space, and thus standard algorithms (those well suited in lower 
dimensions) often suffer from an exponential blowup in sample complexity (the 
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number of samples required to obtain an accurate approximation). This is partic- 
ularly problematic in many practical settings, since the amount of data available is 
often highly limited. 

Fortunately, it is well known that functions arising in practice often possess 
low-dimensional structure. Specifically, they admit approximately sparse represen- 
tations, meaning that they can be efficiently approximated using a relatively small 
number s of functions from a particular dictionary. With this in mind, the aim of this 
chapter is to address the following fundamental question: supposing a function has 
an approximately sparse representation, how many samples (of a given type) suffice 
to learn such an approximation from data and how can it be computed? 


1.1 Main Problem 


Let (D, J, p) be a probability space. Here, D is typically a subset of R“, where 
d > 1 is the dimension of the problem. We consider approximating functions 
defined over D. In many applications, such a function takes scalar values. However, 
other applications call for the approximation of functions that are vector- or function 
space-valued. To this end, in this work, we let V be a separable Hilbert space over 
the field C and consider a function of the form 


f:D-Y. 


Note that V may be taken as (C, |-|) in the case of scalar-valued function approx- 
imation or (CF, ||-|| 2) in the case of vector-valued function approximation. Alter- 
natively, it may be an infinite-dimensional Hilbert space of functions. We discuss 
several motivations for studying this case later. Note also that we consider vector 
spaces over complex fields. Doing so presents a number of additional challenges 
over considering the real case only. 

Let Li (D) be the Lebesgue space of complex scalar-valued, square-integrable 
functions on D. We now consider a known dictionary of functions 


@ = {h :6€ F}CLE(D), 


which may be finite, countable or uncountable, and we assume that f has an 
approximate s-sparse representation in ®. That is to say, there exists a set § C # 
of size |S| < s for which 


fe fs:= lads (1) 


eS 


where the coefficients c, € V are elements of the Hilbert space V. 
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Motivated by the high dimensionality of the domain D, our primary focus in 
this work is on random sampling schemes. To this end, we assume that there 


are probability measures 41,...,/4m on D, we draw m independent samples 
Y1,---,¥m with yj; ~ j,i = 1,...,m, and we assume that the data takes the 
form 

(i, fi) +71), i=1,...,m, (2) 


where nj; € V is measurement noise. With this in hand, we may reformulate the 
main question stated previously as follows: how should one choose the number of 
samples m, the sampling measures [11,..., [Lm and the learning procedure so that 
an approximation to f yielding an error close to that of the sparse representation fs 
can be computed from the data (2)? Furthermore, is this approximation stable to 
measurement noise? 

Lacking any further insight, the standard random sampling strategy involves 
drawing samples in a Monte Carlo fashion from the underlying measure p; in other 
words, we let 41 =... = Um = p. We consider this strategy the starting point for 
the discussion. To this end, we also consider the related question: what is the sample 
complexity of Monte Carlo sampling, and to what extent can this be improved by 
changing the sampling measures 1;? 

Note that the focus of this work is on approximations that can be computed 
(potentially up to some tolerance) in finite time. When f takes values in an infinite- 
dimensional Hilbert space V, this presents an issue. To address it, we assume the V 
can be discretized via a finite-dimensional space Vy, (here h > 0 is a discretization 
parameter) and then proceed to perform computations in V;,, as opposed to V. Thus, 
another important question we discuss in this chapter is what is the effect of this 
discretization on the ensuing approximation to f? 


1.2. Overview 


The purpose of this chapter is to survey a recent body of work that has sought to 
answer these questions. See Sect. 1.4 for a detailed summary of relevant literature. 
We divide our discussion into two main cases. 

First, we consider the case where the target set S in the sparse representation (1) 
is known. This is by far the simpler situation, yet it can indeed occur in certain 
problems arising in practice. For example, S may be obtained by a priori regularity 
estimates on f. Moreover, even though it may not be applicable in general, 
examining this case helps provide insight into what can possibly be achieved in 
the second setting, where S is unknown. 

We provide an almost complete set of answers to the above questions in this 
first setting. The approximation is learned through a simple (weighted) least-squares 
fit, which is readily shown to provide accurate and stable approximations. We also 
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obtain a general condition on the sampling measures /1;, aS well as explicit examples 
satisfying such a condition, for which only 


m = s - log(2s/e) (3) 


such samples suffice for recovery, with probability at least 1 — € for some € > 0. 
This condition is optimal up to the constant implied by the > symbol and the log 
factor. As we also discuss, the near-optimal sample complexity bound (3) typically 
does not hold in the case of Monte Carlo sampling. We discuss examples where the 
corresponding bound for Monte Carlo sampling can be arbitrarily large. Section 3 
is devoted to weighted least-squares approximation. 

Unfortunately, the first case is rather rare in practice. It is more common to 
encounter the situation where S is unknown a priori. To overcome this, one may 
seek to employ adaptive sampling while building S in an iterative manner, typically 
via a greedy scheme. While such procedures can sometimes work well in practice— 
especially when s is relatively small—they often lack theoretical guarantees. 
Instead, we pursue a different approach using tools from sparse regularization, in 
which we seek to promote the sparsity of f in the dictionary ® via ¢!-minimization- 
type techniques. Analysis of this case can then be performed using tools from 
compressed sensing theory. 

In this case, we assume that @ is a finite set of linearly independent elements. 
Let n = |®|. Our main result on sample complexity in this case demonstrates that 
there exist choices of sampling measures 11, ..., (4m for which 


m & (b/a) - (67/a) 8 (log(en) - log*(e(b/a)(6?/a)s) + log(2/e)) 


sample suffice for recovery, where a,b > 0 are the Riesz basis constants of ® 
(see (45)). Here, @ is an explicit constant, given by 


Pay max |¢,(y)|? do(y). 
DicFZ 


We also present a sample complexity bound for Monte Carlo sampling, which takes 
the form 


m & (b/a) - (8 /a) +s (log(en) - log*(e(b/a)(6?/a)s) + log(2/e)), 
where 
@? = max oe 
max lldil7-(0) 
Notice that 6 < ©. Hence the former strategy is always at least as good as Monte 
Carlo sampling. We present examples where 9 = © = 1 (in which case, Monte 


Carlo sampling is sufficient) and where 6 < @ (in which case, the former strategy 
is strictly better). 
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1.3 Additional Contributions 


In tandem with the various sample complexity bounds, we also present error 
bounds for the learned approximations. These show that such approximations are 
accurate—.e. the error is bounded by the best approximation error f— fs, measured 
in some norm—and stable to noise, i.e. the error scales linearly with the noise values 
n;. In the Hilbert-valued setting, we also determine stability to discretization error, 
in the sense that the error involves an additional term that is proportional to the 
orthogonal projection onto Vj. 

Several of our examples consider function approximation on tensor product 
domains, such as the symmetric hypercube D = [—1, 1]“ in d dimensions. How- 
ever, certain practical applications result in approximation problems on irregular 
domains. Another contribution of this chapter is to introduce a new approach 
for function approximation on irregular domains via sparse regularization. We 
demonstrate the efficacy of this new approach through both theoretical guarantees 
and numerical examples. 

Finally, we also discuss settings where f admits a structured sparse approxi- 
mation in the dictionary ®. Such representations arise frequently in practice and 
can lead to tangible benefits in accuracy. We consider two such models, weighted 
sparsity and lower set sparsity, and briefly describe the extension of the main results 
to these settings. Focusing on the irregular domain case, we also showcase the 
benefits of such structured sparsity models via numerical examples. 


1.4 Related Literature 


This work is motivated in great part by applications arising in parametric models. 
Here, one seeks to understand how the parameters in a physical model—a weather or 
climate model, a chemical or biological process, a fluid flow model such as ground- 
water flow, a nuclear reactor, an aircraft engine etc.—affect its output. Parametric 
models are ubiquitous in engineering and the physical sciences. Approximating the 
input—output map of a parametric model is a problem that lies at the heart of many 
key tasks in parametric modelling, such as performing uncertainty quantification, 
parameter optimization or solving parametric inverse problems. See [42, 61, 83, 85] 
for detailed introductions to this topic. 

Parametric models are often formulated as (systems of) DEs. In such problems, 
the function f is the solution u of a PDE system of the form 


Ly. (u, y) =9, (4) 
defined over a physical domain S2 and subject to suitable boundary conditions, 


where -%,(-, y) denotes a differential operator in the physical variable x which 
depends on y. Therefore, the solution u is also a function defined over 2 x D, 
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and for each fixed y € D, the solution u(-, y) is an element of a function space V. 
Note here that V may be a Hilbert or a Banach space, depending on the particular 
form of (4). The typical goal in such settings is then to compute a quantity of interest 
(Qol) Q : V > R depending on u, e.g. the expectation or variance of u with respect 
to y at certain points (x;) or the integral of u with respect to the physical variable x 
as a function of y. Depending on the task at hand, a number of quantities of interest 
may be required, and in such scenarios computation of a fast surrogate of the full 
parameter-to-solution map y +> u(-, y) € W is desirable. 

Generally speaking, evaluating u(-, y) (or some Qol of u) at a fixed value of y 
is expensive. This involves either a costly physical experiment or a computationally 
intensive numerical simulation to (approximately) solve (4). Hence, the objective is 
to approximate u, or some Qol, from as few sample values 


u(-, y1), sy u(-, Ym) 


as possible. There are many different approaches to effect such an approximation, 
many of which seek to exploit low-dimensional structure of the solution u, typically 
in the form of sparsity with respect to a dictionary. Amongst the most popular 
methods are those which use a basis of algebraic polynomials (termed polynomial 
chaos expansions in uncertainty quantification), which are motivated by the fact 
that solutions of many parametric DEs (4) are smooth functions of their parameters. 
But there are also techniques based on multiscale or hierarchical bases, radial basis 
functions, trigonometric polynomials, and various others. Furthermore, there are 
adaptive or learned basis methods, such as, most recently, techniques involving deep 
neural networks. 

The systematic study of least-squares approximation in general finite- 
dimensional subspaces from Monte Carlo samples began with the work of [33], 
with a focus on spaces of algebraic polynomials. Other early works on algebraic 
polynomials include [23, 66, 70]. It was observed that Monte Carlo sampling 
can lead to large sample complexities or poor approximations, which in turn 
led to a series of investigations into the design of improved sampling strategies; 
see [13, 38, 41, 48, 49, 69, 73, 82, 88, 103-105], and the references therein. 
The matter of optimal sampling was theoretically resolved in [49] for specific 
polynomial subspaces and later [30] for general spaces. However, drawing samples 
from the resulting measures may not always be straightforward in practice. The 
measures are also nonadaptive. This led to various further extensions, including 
adaptive strategies [16, 67], more practical approaches based on discrete measures 
[4, 38, 68] and recent work on boosting [38, 47]. For other reviews of this topic, see 
[13, 31, 46, 48]. 

The application of ¢!-minimization for computing sparse polynomial approxima- 
tions of functions was first considered in [18, 39, 65, 79, 97]. This led to substantial 
amounts of subsequent research, including [7, 9, 10, 14, 19, 26, 26, 43, 45, 52, 55, 
57, 63, 74, 76, 78, 81, 86, 90, 91, 94, 96, 98-102]. Specific extensions to weighted 
and lower sparsity models were developed in [1—3, 10, 13, 25, 75, 80, 99]. The 
generalization to Hilbert-valued functions was considered in [36]. As in the case 
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of least squares, Monte Carlo sampling can lead to poor sample complexity bounds. 
Thus, a series of works considered improved sampling strategies [15, 37, 44, 50, 58, 
62, 87, 95]. Weighted and lower set sparsity were developed in a series of works 
[2, 8, 25, 80]. For additional reviews of this topic, see [8, 13, 51, 60, 63, 64, 72]. 


1.5 Outline 


The remainder of this chapter surveys the topic of constructing sparse approxima- 
tions to scalar- or Hilbert-valued functions from sample values via least squares 
or ¢!-minimization. Our focus is on the question of sampling and, in particular, 
whether or not optimal sampling can be achieved. We combine ideas from many of 
the aforementioned works, which are generally specific to polynomial approxima- 
tions, and describe them in the setting of general dictionaries of functions. 

The outline of the remainder of this chapter is as follows. First, in Sect. 2 we 
introduce various preliminary concepts and notation. We then formalize the main 
problem and three main questions and introduce the main examples considered 
later to highlight the main results. Next, in Sect. 3, we consider least-squares 
approximation. We provide definitive answers to all three main questions and 
present several numerical examples. In Sect. 4, we consider £'-minimization. We 
present a series of theoretical results and then describe the extent to which they 
resolve the three main questions. In Sect. 6, we consider the extension to weighted 
and lower set sparsity models. Finally, we end in Sect. 7 with some conclusions and 
open problems. 


2 Preliminaries 


In this section, we first provide some key notation, and then we describe the setup 
and main problems in further detail. 


2.1 Notation 


As noted, throughout (D, J, p) is a probability space and V is separable Hilbert 
space over the field C with inner product (-,-)y and the corresponding norm 
lullv = ./(v, v)y for v € V. We write LAUD; V) for the Lebesgue—Bochner space 
of functions f : D — V for which the norm 


1/2 
I fllnz(o.%) = ( - IFO) 400) <0. 
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We also write L°°(D; V) for the Lebesgue—Bochner space of functions f : D > V 
for which the norm 


Il fll noo) := ess sup || f(y)Ily < 00. 
yeD 


Note that we also denote the classical Lebesgue spaces of complex-valued functions 
f :D— Cas Li(D) and Ly (D). We write II-llz2~) and II-llzs2ca) for their 
norms, respectively. These coincide with the Lebesgue—Bochner spaces whenever 
V is taken as C with the obvious inner product. 

Given an index set .% that is at most countable, we write £?(.%; V) for the £? 
space of V-valued sequences (v,),<.¢ with finite 2?-norm, defined by 


1/p 


lvleoey =| > tule} . 1s p<oo, 
ve F 


lvlleoy.g:vy) = sup llully, p= oo. 
EF 


When p = 2, we also write (-, -)»2(.g.y) for its inner product. Note that when V = 
C, we write £?(.7%) and ||-||¢r(.7) or simply ||-||~x> when the choice of -7 is clear. 
Likewise, for p = 2 and V = C, we write (-, -) for the ¢?-inner product on 2(F ). 

As discussed above, the space V may be infinite dimensional. Hence, performing 
computations in V directly is often not possible. To this end, we introduce a finite- 
dimensional discretization of V, denoted by Vj, where h > 0 is a discretization 
parameter. We assume that V;, C V is a subspace of V and write 


Pr > V—> Vn 


for the orthogonal projection onto this subspace. Furthermore, given f € Li (D; V), 
we write Y;, f € L2(D; V;,) for the almost everywhere defined function given by 


(Pr f(y) = Pa(fO)), ye D. 


When necessary, we also employ a not necessarily orthonormal basis of V;. We 
write { wi, for such a basis, where k = dim(V,). 

Finally, we require a few additional pieces of notation. For convenience, we write 
[n] := {1,...,n} for n € N. We also use the notation A < B to mean that there 
exists a numerical constant c > 0 such that A < cB, and likewise for A = B. 
Furthermore, we write A <, B if A < c,B for some constant c, > 0 depending on 
a variable x, and likewise for A >, B. 
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2.2 Problem and Key Questions 


As above, we let ® = {g, :1 € Y}C L?(D) be a dictionary and f € LD; V) be 
the function we seek to learn. We consider s-sparse representations of f of the form 


fs:= lob, ae, (5) 


eS 


where S C %, |S| < s is a subset of s indices. We now formalize the two main 
settings we consider in this work: 


Problem 1 (Sparsity in a Known Subset) The function f has an approximate s- 
sparse representation of the form (5) for some known set S. 


Problem 2 (Sparsity in an Unknown Subset) The function f has an approximate 
S-Sparse representation of the form (5) for some unknown set S. 


As discussed above, we consider sample points drawn randomly according to 
probability measures 11, ..., /4m. We term these the sampling measures. We make 
the following assumption: 


Assumption 1 (Absolute Continuity and Positivity) The additive mixture 


1 m 
hos apa 
f Ht 


is absolutely continuous with respect to p, and moreover its Radon—Nikodym 
derivative is strictly positive almost everywhere on supp(|t). 


This means that we can write 


1 m 1 
—S° dui(y) = —~ dp), (6) 
m — w(y) 


where w : D — R is finite almost everywhere on supp(jz). We refer to w as the 
weight function. Note that it satisfies 


1 
[ eo. (7) 
p wy) 
Given such sampling measures, we now draw samples y; ~ j,i = 1,...,m, 


independently from these measures and consider noisy data of the form 


(i, fi) +n) € Dx Vn, i=1,...,m. (8) 
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Here, the n; € V are terms that capture the measurement error. We focus on the case 
where these terms are small in norm, but we do not assume they follow a specific 
distribution (e.g. Gaussian noise in the scalar- or vector-valued case). Note that we 
assume the measurements f(3;) + n; are elements of the finite-dimensional space 
V;,. Our motivation for doing so is the following. Since f is V-valued, the noiseless 
sample f (y;) is an element of the (potentially) infinite-dimensional Hilbert space V. 
In general, this quantity cannot be stored, let alone used as the input to an algorithm 
for learning an approximation to f. Hence, we assume that the measurements are 
elements of the finite-dimensional subspace V;,, which means they can be both 
stored—for example, by storing their coefficients with respect to the basis {witt, 
for V;—and used as input to a learning algorithm. Note that the quantity nj € V 
accounts for both the discretization error in approximating the true sample f(»;) 
by an element of V;, and any other errors that arise in the measurement process 
(e.g. noise, numerical error and so forth). Furthermore, we do not specify how 
measurements are processed to give elements of V;,; we simply assume that this 
process yields an error that can be captured by the generic noise term n;. In other 
words, the model (8) is very general, and therefore sufficient for many applications. 
We now formalize the three main questions considered in this work: 


Question 1 Suppose f satisfies either Problem I or Problem 2. How does one learn 
an approximation to f from the data (8) that is accurate—i.e. the approximation 
error is bounded by the errors f — fs and, in the case of Hilbert-valued functions, 
f -— An(f), measured in suitable norms—and stable—i.e. the error depends 
moderately on the noise values n;, measured in a suitable norm? 


Question 2 In the setting of either Problem 1 or Problem 2, how many samples m 
are sufficient to obtain such an approximation, and how does the choice of sampling 
measures affect this bound? 


Question 3 In the setting of either Problem I or Problem 2, does Monte Carlo 
sampling lead to near-optimal sample complexity—i.e. linear in s, up to constants 
and log factors? If not, is there a near-optimal choice of sampling measures? 


2.3. Examples 


We now introduce the main examples considered in this work. 


Example I (Trigonometric Polynomial Approximation on the d-Torus) Let d > 1, 
D = T° is the unit torus in d dimensions and let do(y) = dy be the uniform 
measure. In this example, we set .% = Z4 and consider the set of functions 


oly) =exp(2ri-y), 6=(t1,...,4a)€ %, Y= (On,---, ya) € Dz 


Observe that the dictionary ® = {@, : s € -%} forms an orthonormal basis of i (D). 
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Trigonometric polynomial approximation of smooth and periodic functions in 
high dimensions is a classical topic [35, 77, 89]. The nature of the Fourier basis 
makes it a relatively straightforward case to study, and as we see later, this also 
yields clear answers to Questions 1-3. 

Unfortunately, many problems—in particular parametric model problems—do 
not involve periodic functions. Since such functions are often smooth, however, this 
motivates the study of algebraic polynomial approximations. 


Example 2 (Algebraic Polynomial Approximation in the Symmetric Hypercube) Let 
D = [—1, 1] be the symmetric hypercube in d > 1 dimensions of side length 2 
and do(y) = 2~¢ dy be the uniform measure. We set .% = Né and consider 


dy) = Pu 1) +++ PtaQa), t= (41,.--, ld) € S, y= (y1,---. Ya) € D, 
(9) 


where, on the right-hand side, p, denotes the one-dimensional Legendre polynomial 
of degree, 1 € No, normalized with respect to the one-dimensional uniform measure. 
Note that p,(y) = V2 + 1P,(y), where P,(y) is the classical Legendre polynomial 
with normalization P,(1) = 1. 


As observed, algebraic polynomial approximation is used widely in parametric 
model problems. This is motivated by the observation that many classes of 
parametric differential equations are holomorphic (analytic) functions of their 
parameters (see [13, 22, 24, 28, 32, 53, 56, 92], and the references therein). This 
means the polynomial coefficients decay rapidly, yielding approximately sparse 
representations in a given polynomial basis. Note that Problem | is naturally 
motivated by such problems. For certain classes of parametric differential equations, 
one can use a priori analysis to determine coefficient estimates, and using these 
obtain a candidate set S. However, this is not feasible for more complicated 
parametric differential equations or problems where f is given as a black box. In 
this setting, we resort to Problem 2. 

Note that it is also common to consider other systems of polynomials. Com- 
mon examples include Chebyshev polynomials on [—1, 1]“, Laguerre or Hermite 
polynomials on [0, 00)@ and R%, respectively, or nonorthogonal polynomials such 
as Taylor polynomials. For succinctness, we consider Legendre polynomials only, 
although our analysis readily extends to more general settings. 

Many polynomial approximation problems are naturally formulated on compact 
hyperrectangles. Using a change of variables, these can all be reduced to the setting 
of Example 2. In parametric models, this is inspired by the notion that the parameters 
are independent, with each one varying between a finite upper and lower value. 
However, this assumption can fail in practice. In parametric models, for example, 
there may often be dependencies between the parameters [40, 59, 61, 84]. This leads 
to polynomial approximation problems where the domain D, while still compact, 
is no longer a hypercube but an irregular-shaped domain. This poses a number of 
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challenges, which we shall review later in this chapter. It motivates our third and 
final example: 


Example 3 (Polynomial Approximation on General Compact Domains) Let D C 
R¢ be a measurable set with nonzero measure. We assume without loss of generality 
that D C [—1,1]@ is contained the symmetric, d-dimensional hypercube with 
side length 2. Following a well-known approach, studied in detail in [6], we then 
construct a polynomial dictionary by restricting the orthonormal basis of Example 2 
to D. For consistency of notation, we continue to denote this dictionary as = {¢, : 
t € .£}. We also let p denote the uniform measure on D, i.e. do(y) = |D|~! dy. An 
important observation in this case is that, in contrast to the previous two examples, 
this dictionary is not a basis of L7(D). Rather, it forms a frame [27]. On the other 
hand, every finite set of elements from @ is linearly independent—indeed, no finite 
linear combination of polynomials can vanish on a set of nonzero measure—and 
therefore a Riesz basis for its span. These two observations will be particularly 
important later when we consider ¢!-minimization techniques in the setting of 
Problem 2. 


2.4 Multi-Index Sets 


Notice that the dictionary ® in Examples 1, 2 or 3 is indexed over a multi-index 
Le %$=Ziaie J = NG. It is useful to define a number of standard choices 
for finite subsets of -%. In the case of Problem 1, such an index set could be used as 
a potential choice for S. Whereas in Problem 2 we see later that it is important to 
truncate the infinite set of multi-indices Ne or Z4 to some finite, but large subset in 
which we expect the indices of the sparse representation to belong. 

Several standard subsets are the tensor product index set 


i es (en) eee eee (10) 


of order t € No, the total degree index set 


d 
7 alia cotays douse, (11) 
k=1 


of order ft, and the hyperbolic cross index set 


d 
INC — .- (at_y:T]@4+) <t4 i) (12) 


k=1 
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of order t. Note that these index sets are subsets of N4 and therefore suitable for 
Examples 2 and 3. We define analogous subsets of Z“ for Example | simply by 
replacing ¢; by its absolute value |z,| in (10)-(12). 

The choices (10)-(12) are commonly used in low to moderate dimensional 
problems when selecting the index set S in the setting of Problem 1. However, 
their respective cardinalities grow rapidly with dimension; this is in particular true 
of (10), whose cardinality is (n+1)¢. This makes their applicability limited in higher 
dimensions, as, for a fixed maximum cardinality s, it may be impossible to achieve 
high orders, which are generally necessary and correspond to better accuracy. 
Indeed, in higher dimensions, it is often important to incorporate anisotropy into 
the index set S to take into account different rates of variation of the function in 
different coordinate directions. By contrast, the index sets (10)-(12) are isotropic; 
indices in S remain in S when their entries are permuted. While it is possible to 
define anisotropic versions of each of these index sets (see, for example, [17]), the 
challenge becomes to set the anisotropy parameters in an a priori manner without 
knowledge of the underlying function f. Instead, we adopt the setting of Problem 2 
and suppose f has a sparse representation in some unknown index set S contained 
within a larger but finite index set of the above form—the goal then being to compute 
an approximation achieving a similar error as that of the sparse representation, 
without necessarily computing S itself. 


3 Sparse Approximation via (Weighted) Least Squares 


We first suppose that Problem | holds and also that m > s. Let S C Y, |S| <5, be 
the corresponding subset, and define the resulting subspace 


Psy = [oa CO. € “| Cc L2(D; V). 


eS 


Note that if V = C, we simply write Ps for subspace of complex-valued functions 
Ps = Psc = 1) e8 CQ 2 C. E Cc} Cc L? (D). Next, we recall the discretized 
subspace V;, of V and the noisy samples (8). With this in hand, we follow a similar 
approach of [30] (which considers only the real scalar-valued case) and define the 
weighted least-squares approximation to f as 


7 1 m 
fe senin| mofo 4m - roth} (13) 
pePsy, (™ jay 


Here, w is the weight function specified in (6). Notice that we form an approxima- 
tion in the subspace Ps.y,, as opposed to Ps.y, since we generally cannot perform 
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computations over the infinite-dimensional Hilbert space V. In the scalar-valued 
case, we simply have V;, = V = (C, |-|). 

3.1 Computation of the Least-Squares Approximation 


We first describe the computation of the approximation (13). Since any p € Ps.v, 
can be expressed p = eee c.g, with c, € Vz, we can rewrite (13) as 


f= 4b. @= Ehes € argmin||Ac — v]] 2Qm];y)> (14) 
1c ceVi, 
where 
= — (0,09) ec™, 
J/m 7 ie{m],j€[s] 
i (15) 
Fa (Veo FO) +7), Mh 
and {t1,...,¢s} is an enumeration of the indices in S. Note that we consider A both 


as an m xX s matrix and as a mapping V* — YW” defined in the obvious way, i.e. 


Ac = (Syety Aijcj) = e€ VW” for c = (cj) jets] € V*. In the case of scalar- 
: “ "Fiel[m i 
valued function approximation (i.e. V = V;, = C), the problem (14) is a standard 


algebraic least-squares problem. 
Moreover, in the general Hilbert-valued case, it is a straightforward exercise to 
show that a solution of (14) is given by 


C= A'v. (16) 


Here, At is the pseudoinverse of A, or more precisely, its extension in the above 
manner to a mapping V” — VY’. In particular, if A is full rank, then this is the 
unique solution of (14). Now recall the basis (wil, for V;,. Observe that we can 
write the ith component v; of the V;,-valued vector v defined above as 


= Vivi, ie tml, 


Jelk] 


for scalar coefficients V;; ¢ C. Likewise, we can also write 


C = (Ci)iets], = Cij. i € [s], 


Jelk] 
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for scalar coefficients C ij € C, so that i. can be expressed as 


f=) >> Cid; @ Wj. 


ieS je[k] 
Letting C= (Cij ies}, jete] and V = (Vij )ietm), jek] and using (16), we see that 
C=AlV. 


Hence, the coefficients C can be computed by first computing the pseudoin- 
verse A‘ and then performing the above matrix—matrix multiplication, for a total 
of @ (ms? + msk) floating point operations. Alternatively, one could solve k 
standard algebraic least-squares problems for the columns of V. If conjugate 
gradients are used, for example, the cost of obtaining a residual error of size 7 is 
0 (cond(A)msk log(n7')), where cond(A) is the condition number of A. This may 
be more efficient in the case where k <_s, in particular, the scalar-valued case, 
where k = 1. 


3.2 Accuracy, Stability and Sample Complexity 


In this and the next several subsections, we investigate Questions 1-3. We com- 
mence with Question 1. The accuracy and stability of the approximation (13) is 
governed by the existence of a norm equivalence over Ps. Specifically, we assume 
that 


m 


alt 


i= 


for constants 0 < a < B < _ o. In other words, the functional p b 


“ J _, w(yi)|pQi)/? is an equivalent norm over Ps to the L?,-norm. We remark 


also that (17) is a condition for the space Ps = Ps.c consisting of scalar-valued 
functions. As the next theorem shows, however, such a condition also determines 
accuracy and stability for the approximation of Hilbert-valued functions in the space 
Ps.y,,. With this in hand, we now also define the discrete semi-inner product 


m 


1 
(8. A)dise = — D WOI(SOD. AOD. gh € LZ(D:V), 
i=l 


and the corresponding discrete semi-norm ||g|lqjg¢ = V/ (8 8) disc> & € iD: V). 
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Theorem 2 (Accuracy and Stability of Weighted Least Squares) Let f ¢€ 
L3(D),0<a< B < ©, {yi}¥L, © D,w: D > R be such that w(y;) is 


well defined for alli, and suppose oe (17) holds. Then the approximation a in (13) 
is unique and satisfies 


A 1 
_ ey) < inf _ ; 1 
If Alazcon = inf {MP Plizioen + ( +) ly= Plaise 


1 
+I — PaPllzzc;vy + Ya |Cleamvy: 


where e = Ta (/wOini);_ ey". 


This result (see Sect. 3.7 for its proof) asserts that the error for the learned 
approximation 7 splits into three quantities: first, a best approximation error term in 
the subspace Ps.y, second, a space discretization error, which accounts for the fact 
that the least-squares problem is formulated over Vj; as opposed to V and is equal 
to the projection error f — Y,(f), and third, a term depending on the measurement 
noise values n;. Note that this theorem does not require the points {y;}/"., to be 
random. It holds for any fixed set of sample points whenever (17) also holds. 


Remark I This result has several disadvantages. First, the noise terms nj; are 
multiplied by the weight factors ./w(y;), meaning that noise terms corresponding 
to large values of w are weighted more heavily. Second, the best approximation 
error mixes the i (D; VY)-norm (which is the norm in which the error f — f is 
measured) with the discrete norm || f — plldisc. When the sample points y; are 
random variables (as they will be below), one can use this fact to slightly modify 
the approximation f in a way in which error bounds involving only || f — p||l 12(D;V) 
can be obtained. We omit the details. See [30, 31, 33] for further information in the 
scalar-valued case. 


Remark 2 The reader will notice that Theorem 2 does not involve the upper constant 
B in (17). While not strictly needed for this theorem, this constant plays a role in 
the computation of the least-squares approximation. Indeed, it is straightforward 
to show that the condition number cond(A) is bounded by ./8/a whenever {¢, : 
t € S} forms an orthonormal basis for Ps. Hence, when the ratio B/a is small, 
the least-squares system can be solved more efficiently (when employing conjugate 
gradients), and its output is less affected by floating point errors. 

This property is relevant to Examples | and 2, since they involve orthonormal 
bases. On the other hand, the least-squares matrix A will be poorly conditioned 
whenever the system {f, : ¢ € S} is near-linear dependent. This occurs notably 
in Example 3 [6]. Perhaps counterintuitively, this does not necessarily lead to 
substantial errors in the resulting least-squares approximation. In fact, whenever 
the infinite system of functions forms a frame (as it does in Example 3), this 
property endows the problem with sufficient structure to ensure accurate and stable 
(regularized) least-squares approximations; see [6] for further discussion. 
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We now progress to the matter of sample complexity, which will lead to answers 
to Questions 2 and 3. As shown in [30] (see also [4]), sample complexity of the least- 
squares scheme is determined by the existence of a so-called weighted Nikolskii-type 
inequality over Ps. Specifically, let. (Ps, w) be the smallest constant such that 


IIPllzow) S VPs, wllplizzay), YP € Ps. (18) 


Again, we observe that this inequality is formulated for the space Ps = Ps-.c of 
scalar-valued functions. We remark also that (Ps, w) is related to the Christoffel 
function K (Ps) of the subspace Ps. Specifically, 


AN (Ps, w) = | VwOKPIO| 


: (19) 
L&(D) 
where K(Ps) is the reciprocal of the Christoffel function of Ps. Let {u}ies C 


Li(D) be any orthonormal basis for Ps. Then, this function has the explicit 
expression 


K(Ps)(y) = Do luc. (20) 


eS 


Theorem 3 (Sample Complexity of Weighted Least Squares) Let 0 < « < 1, 
0<6 <1, £y,..., Lm be probability measures on D satisfying Assumption I and 
Y1, +++» Ym be independent with y; ~ pj; fori = 1,...,m. Suppose that 


m > cs (NV (Ps, w))” log2s/e), cs = (1 — 8) log(1— 8) +8)", (21) 
where w is the weight function specified in (6). Then, (17) holds with 1 —5 <a < 
B < 1+ 6, with probability at least 1 — €. 


This theorem (see Sect. 3.7 for the proof) states that the sample complexity is 
dominated by the behaviour of the weighted Nikolskii constant -V (Ps, w). Observe 
that 


(W (Ps, w))? > s, (22) 
for any choice of w. Indeed, (.V (Ps, w))* > w(y)K(Ps)(y) for almost every y, 
and therefore 


1 
(WV (Ps, w))* / —~ dp(y) = / K (Ps)(y) dp(y). 
p wy) D 


The left-hand side is equal to (VY (Ps, w))? due to (7), and the right-hand side is 
equal to s, due to the relation (20) and the fact that the u,s are orthonormal. 
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3.3, Monte Carlo Sampling 


We are now ready to discuss the first part of Question 3 in the context of the 
examples introduced in Sect. 2.3. Recall that Monte Carlo sampling corresponds 
to setting 


Mi =.-.-=Um= Pp. 


In this case, it follows from (6) that the function w(y) = 1. Hence, f is a standard 
unweighted least-squares approximation. As shown by Theorem 3, the sample 
complexity 


m > cg (WN (Ps))° - log(2s/e) (23) 


is governed by the unweighted Nikolskii constant 


N (Ps) =| VKPNO| 


. 24 
ae (24) 


We are interested in the behaviour of (.”(Ps))” in relation to s = |S|. Clearly, 
there are instances where (.(Ps))? attains the optimal value (VW (Ps)? = 5 
(recall (22)). Indeed, the functions ¢, of Example | are orthonormal and equal to one 
in absolute value. Hence, K (Ps) = s by (20), and (23) yields the sample estimate 
m = s-log(2s/e), which is optimal up to constants and log factors. 

Unfortunately, this desirable property does not hold in general. As the next result 
attests, the constant .V (Ps) can generally be arbitrarily large in comparison to s. 


Lemma 1 There exists a probability space (D, Y, p) such that following holds. 
For every s € N andC > 0, there exists a subspace P C L? (D) of dimension s 
such that WV (P) > C. 


Proof We consider Example 2 in the case d = 1. The classical Legendre 
polynomial P, attains its maximum value at y = | and takes value P,(1) = 1. 
Hence, 

IIPullzcoq—iip = Pe) = V2e+ 1. (25) 


It follows that for any subspace P = Ps, we have 


(AV (Ps)? = IK (Ps) Olligeoy = K (PsA) = S22 + D. 
eS 


Since § Cc Né, |S| = s can be arbitrary, we now choose it so that the right-hand side 
exceeds C. oO 
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This lemma and its proof suggest that Monte Carlo sampling may be highly 
suboptimal in the setting of Example 2 (and therefore Example 3 as well) if the 
indices in the target set S are allowed to become arbitrarily large. One way to 
mitigate this is to impose additional structure on S. A common structure is that 
of lower sets. 


Definition 1 A multi-index set 4% C N¢ is lower if whenever: € -Y andx < (this 
inequality is understood componentwise), then x € ¥%. 


Note that many common index sets used in polynomial approximation are lower. 
For example, the sets (10)—(12) are all lower. In general, lower sets are known to be 
good candidates for the support sets of polynomial coefficients of smooth functions 
in high dimensions [2, 8, 13, 23, 25, 25, 31]. In particular, this is true for solutions 
to parametric PDEs, where the lower set sparsity has been studied and variously 
exploited to construct effective polynomial approximations [13, 21-24, 28, 31]. 
Motivated by Example 1, we observe that is also straightforward to define lower 
subsets of Z%. In this case, we replace the inequality by |x| < ||, where, for a multi- 
index 1 = cava € Z4, |t| = Cae is the multi-index of its absolute values. 

In the case of Example 2, it is known that when S C N¢, |S| < s, is a lower set, 
one has 


(4 (Ps))* S 87; 


see [22, 23]. Furthermore, this bound is sharp, in the sense that there exists a lower 
set S of size s—specifically, the set S = {(,0,...,0) : v = 0,...,s — 1}—for 
which (.V(Ps))* = s*. Hence, imposing a lower set structure reduces the sample 
complexity for Monte Carlo sampling to at worst quadratic in s, up to log factors. 


Remark 3 In view of Example 3, we remark in passing that this quadratic bound 
also holds for arbitrary lower sets and large classes of irregular domains [6], up to 
a domain-dependent constant. Moreover, this also holds for any Lipschitz domain 
in the case where S = gl D ig the total degree index set (11) [38]. On the other 
hand, for domains with C? boundary and S = IID, one has a better scaling in 
higher dimensions; namely, (NV (Ps))* < cps!t!/4, where cp > 0 is a constant 
depending on the domain D only [38]. 


3.4 Optimal Sampling 


We now answer the second part of Question 3 in the affirmative. Our aim is to 
choose the weight function w to minimize -V (Ps, w) and then choose the measures 
satisfying Assumption 1. To do this, we appeal to (19) and, keeping in mind the 
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normalization (7), set 


1 = 1 = 
wo) = ({K(P909) = (EX mov") : 


“eS 


Notice that this yields, via (19), the optimal Nikolskii constant 
N (Ps, w) = V/s. 
In particular, the sample complexity estimate (21) becomes 
m>cs-s-log(2s/e), 


which is optimal up to the log factor. 
Having chosen w, we now choose the measures ju; so that (6) holds. We consider 
two possibilities. The first we term nonhierarchical, and is given simply by 


=... ln = du) = —= dp) =~ TluOrPdpo). 26) 
Hi =...-=ULm =H, Bagh eG u(y) do(y). 


Clearly, (6) holds in this case. The second scheme is hierarchical. In this scheme, 
we suppose that m = ks for some k € N. Then, we define 


Mi = lu, OP dow), G-Dk<isyjk, j=l,...,s, (27) 


where {;, ..., ¢s} is an enumeration of the indices in S. Notice that 


1 m k Ss 1 
— ye dui(y) = = Duy? do) = =) Flu)? do). 
i=l j=l 


eS 


Therefore, (6) also holds in this case. 

The nonhierarchical scheme (26) was introduced in [30] and is suitable for 
learning an approximation in a fixed subspace S$. However, as discussed in [16, 67], 
it is not well suited to the problem where one seeks to learn a sequence of 
approximations A, ph, ... in a hierarchy of nested subspaces S$; C S2 C ---. The 
issue is that as the subspace S = S; changes, the measure jz defined in (26) changes, 
and hence the existing samples are effectively drawn from the wrong distribution 
for the purposes of constructing an approximation in the new subspace S;+1. The 
hierarchical scheme (27), introduced in [67], overcomes this problem; see also [16] 
for a different approach. We refer to [4, 67] for further information. 
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3.5 Practical Optimal Sampling via Discrete Measures 


Unfortunately, generating samples from either the measure (26) or the measures (27) 
may not be straightforward, since it requires an orthonormal basis {v,},<5 of Ps. 
This may not be available in practice, and even it is, drawing samples from the 
resulting measures may be computationally challenging. See [4, 16, 30, 68] for 
further information on this issue, as well as [49, 71] for the specific case of tensor 
product polynomial approximation. 

A remedy to this situation was proposed in [4, 68]. The idea is to replace p (which 
is typically a continuous measure) by a discrete measure, supported on a finite grid, 
so that both constructing an orthonormal basis and sampling from the corresponding 
measures are automatically straightforward. Let Z = {zi}e Cc D bea finite grid. 
We consider the discrete uniform measure given by 


k 
1 
r=5 Dobe (28) 
i= 


The idea is now to replace p by t throughout. Consider the nonhierarchical scheme 
for simplicity. Then, doing so, we deduce that if 


m = s -log(2s/e), (29) 
then the error bound 
If - fllzwy Sint {If - pllzew.v) +I - llaisc| 
a pePsy r 
+ If — FPaDlzzc-vy + llellezqm:vy (30) 
holds probability at least 1 — €, where 


-1 
1 
wor= (ED mor?) , —-y € supp(t) = Z, 


eS 


1 
duly) =<) u(y dr(y), —_ y € supp(z), (31) 
eS 


and {u,}ies C (D) is an orthonormal basis for Ps with respect to T. 
Since Tt is a discrete measure, this orthonormal basis can be constructed via 
straightforward linear algebra. Indeed, define the matrix 


B _ 1 (Zi k ae 
(a,(2 VE) oe jain ae 
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and suppose that it has the QR factorization B = QR, where Q € C*** has 
orthonormal columns and R € C*** is upper triangular. Then the orthonormal basis 
{Ur}ies is given by 


vi (y) = DR" )iibjQ), i € Ls]. 


j=l 
In particular, its values on the grid Z are precisely 
v,(zi)) =VkOij, ie (kl, j € [sl 


Substituting this into (31) and recalling the definition of t, we see that the discrete 
measure ju is given by 


1 
du(y) = DP {= D7 1Oijl } 82,0. 


ie[k] Jels] 


Hence, sampling from jz is now trivial. Indeed, y ~ w if 


1 
Py=a)== DIG), ie [kl 


Jets] 


The reader will have no doubt noticed that the error bound (30) is with respect to 
the discrete measure T. It is often preferable to also have an error bound over the 
original measure p. Such an error bound is guaranteed whenever the L?-norm is 
equivalent to the L?-norm over Ps, ie. 


@'Ilplliacy < IIPlliz~ < B'Ipliiacy: Vp € Ps. (32) 


Indeed, recall that the sampling condition (29) and the choices of 4 and w imply 
a norm equivalence between the L2-norm and the discrete norm over the sample 
points, i.e. 


m 


1 
(= SIPliiap <= Dd wODIPOD? 5 A+d)IPllizp YP Ps. 


i=1 
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Hence, we deduce that 


m 


1 
(1 -8)a'lpliza(p) <= Yo wonlpaal < + 5)B'lIpliza~py YP € Ps. 
i=l 


Therefore, the norm equivalence (17) with respect to the original L?-norm also 
holds, meaning that an error bound in this norm follows immediately from 
Theorem 2 (with constant wa = (1 — 8)a’). 


Remark 4 A simple means to ensure (32) is to construct Z = {z; a as a random 
Monte Carlo grid (independently of the sample points y;). That is, we let the z; be 
independent random variables drawn according to the measure p. Observe that (32) 
is precisely 


k 
1 
/ 2 » 2 u 2 
i=1 


This is nothing more than the special case of (17) for the grid Z (recall that w = 1 
for Monte Carlo sampling). Hence, (32) is ensured by Theorem 3. In particular, it 
holds with 1 — 5 < a’ < p’ < 1 +6, provided 


k > c3- (WN (Ps))? - log(2s/e), 


where, as in Sect. 3.3, WY(Ps) = IK(Ps)C)IlL2°(p) is the unweighted Nikolskii 
constant. Of course, this constant may be very large depending on the choice of 
Ps; recall the discussion in Sect. 3.3. Yet, this grid is only used to define the 
optimal sampling measure jz. Therefore, the number of grid points k only affects the 
computational cost for generating the sample points. It does not affect the sample 
complexity of the weighted least-squares approximation, which is m = s log(2s/e) 
for the optimal measure. 

Having said this, a practical problem is that estimates for .” (Ps) may not be 
available, or if they are, they may not be particularly tight, thus leading to overly 
large grids. In [38], an empirical strategy is described to mitigate this issue, based 
on independently drawing an auxiliary grid that is used to test the quality of the 
grid Z. 


3.6 Numerical Examples 


We conclude this discussion on least-squares approximation with several numerical 
examples. In these and other examples considered later in this chapter, we consider 
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the scalar-valued functions 


if 
fi(y) = exp (-72%] : 


k=1 


Thena/or4t cos(16y,/2*) 


ho) = 
Tat — ye/44) (33) 
d 
7 d/4 
BO)= I] d/4+ (i + (—1)it1/(i + 1))?’ 
fay) 
n= ——— 
ae V1yi| 


Since our goal is to compare Monte Carlo sampling with the optimal sampling 
procedures described above, we focus on Examples 2 and 3 (recall that Monte Carlo 
sampling is optimal, up to the log term, in the case of Example 1). To this end, we 
consider the domains 


D, = [-1, 1, 
Dy ={y €R1:1/4<y?4+...4+.y3 < I, (34) 
D3 ={y €[-1, 1]? : yi +...+y¢a < U. 


We follow the approach of Sect. 3.5 and, in particular, Remark 4, to generate a 
Monte Carlo grid Z and the corresponding discrete measure Tt as in (28). Here, 
k = 305max, Where Smax is the maximum size of s used in the given experiment. For 
the error, we compute the relative L°°-norm error, i.e. 


II f Filluge wy (35) 
II fll nx) 

We perform a total of T = 50 trials. In the Monte Carlo and optimal nonhierarchical 
schemes, each trial corresponds to a single draw of the sample points y;,..., ym» at 
each value of m considered. For the optimal hierarchical scheme, a single trial is a 
full set of points 1, .-., Ynrmax» Where max is the maximum value of m considered. 
In all cases, we report the log-average of the error (35) over these trials, with the 
shaded regions corresponding to one log-standard deviation (see [13, App. A] for 
further information). 

In Fig. 1, we compare Monte Carlo with both the hierarchical and nonhierarchical 
optimal sampling schemes. In two dimensions, typical sample points generated by 
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Fig. 1 The relative error (35) versus m for (weighted) least squares in the case of Examples 2 and 
3. We compare Monte Carlo sampling (left), discrete optimal nonhierarchical sampling (middle) 
and discrete optimal hierarchical sampling (right) for f = f1, D = Do (top row), f = fo, 
D = D3 (middle row) and f = f4, D = Dz (bottom row). In all experiments, S = IHC is the 
hyperbolic cross index set (12) and, for each value of t, m is chosen as the smallest integer such 
that m > s log(s), where s = |. HC, 


these schemes for different domains are shown in Fig.2. As we see from Fig. 1, 
Monte Carlo sampling leads to worse performance compared to both optimal 
sampling schemes, especially in lower dimensional problems. It is notable that 
Monte Carlo sampling also leads to an increasing approximation error in several 
cases, since the number of samples is chosen to scale log-linearly with s, rather 
than log-quadratically (recall the discussion in Sect. 3.3). This is corroborated in 
Fig. 3, where we plot the constant a for the different sampling schemes. On the 
other hand, we observe that the relative performance of Monte Carlo sampling 
improves in higher dimensions, where it offers similar approximation errors to the 
optimal schemes. We see this effect consistently throughout this work. Finally, we 
remark in passing that there is virtually no difference between the nonhierarchical 
and hierarchical versions of the optimal sampling scheme. 
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Fig. 2. The domains D,, D2 and D3 (left to right) in d = 2 dimensions. The grey dots are the finite 
Monte Carlo grid Z = {zi}, with k = 20,000. The top row shows Monte Carlo sampling with 
m = 1824 points. The second row shows the same number of points generated from the discrete 


optimal nonhierarchical sampling measure based on .7 = IHC, where ¢ = 68 and s = 308 
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Fig. 3 The constant 1/,/a versus s ind = 2 dimensions for Monte Carlo sampling (left), discrete 
optimal nonhierarchical sampling (middle) and discrete optimal hierarchical sampling (right) and 


for D, (top row) and D2 (bottom row), where 4% = gC 
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3.7 Proofs of Theorems 2 and 3 


The proofs of Theorems 2 and 3 follow ideas that are now well established in the 
literature (see, for example, [4, 6, 30, 33, 68]). They are included for completeness. 
We commence with Theorem 2. We now observe the following: 


Lemma 2 Suppose that (17) holds. Then, 
2 2 2 
allli2cp.vy < IPlldise < BllPlizacoeyy YP € Ps.vn (36) 


where |\-|| disc #8 as in Sect. 3.2. 


Simply put, this lemma states that if there is a norm equivalence in the scalar- 
valued case over Ps = Ps.c between the continuous L?-norm and the discrete 
norm defined by the sample points, then there is also the same norm equivalence in 
the Hilbert-valued case over Ps.y. 


Proof First, let {wi} , be an orthonormal basis of V; ¢ V. Let p € Ps.y, and 
observe that it has the unique expression 


p= >) > cj, @vj, ci EC, LE Ts], J € LK], 


ie[s] jel[K] 


where {1),..., ¢s} is an enumeration of the indices in S. Let gj = er cijP,;, and 
observe that g; € Ps. Notice also that 


VyeD. 


POI = > |d> ci O)} = D2 lao 


JElK] jiels] je[K] 


In particular, this implies that 


Illia = i IPOlde(y) = i; » lai doo) = YO laste 


jelK] jelK] 
(37) 


and also that 


1 1 
IPligisc = = Dy VODIPODIY = DF |= D7 wodlasP? 


ie[m] JE[K] ie[m] 
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Hence, by the scalar-valued norm equivalence (17), we deduce that 


oO » lailiac < = IP laise <8 x lailcacoy: 


je[K] Je[K] 


The result now follows immediately from (37). oO 


Proof of Theorem 2 Since f is a solution of the least-squares problem, it is also a 
solution of the variational equations 


4 ie 1 
find f € Ps.y, such that (f, ¢)gise = (f. 4) dise + rm x wyi{ni, gOi))v, Va € Ps-v,- 


ie[m] 
Uniqueness of Fi now follows immediately from Lemma 2, since (-, -) qjsc¢ forms an 
inner product on Ps-y and therefore Ps.y,,. 


We now derive the desired error bound. First, observe that since g € Ps.v,,, these 
equations are equivalent to 


find f € Ps.y, such that (f, @) disc = (Fn), Vdisc 
+2 rein wy )(ni, dOu))v. Va € Ps.v,,- 


Now let g € Ps-y, be arbitrary. Then these equations give 


: 1 : 
If —alldise = (Pf) —4. f - Ddisc + — Y= woidtni, FOV) — gO0)v. 


ie[m] 


and applying the Cauchy—Schwarz inequality several times to the right-hand side, 
we deduce that 


If — alldisc < |PaCf) — alldise + llellezqm;v)- 


Here, we also recall the definition of e. Furthermore, using (36) and the fact that 
q € Ps.v,, we get 


se 1 
If —alliz@-y S ae (I Pa(f) — Alldisc + llelle2qm}-vy) « 
Now, let p € Ps.y be arbitrary, and write g = Y_(p). Therefore, 


If — fllazwew SIF — Pa@VIn3 00.7) + Pa) — Pa Mlzco-v) 
+IAnP — fllzzco;vy 


= = (Pa) - Pr(Pildise + llelle2qnj:v)) 


ei. 
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+ PAF — PYlizeo.w + PaN — fllazoew 


1 
=a (If = Plidise + llellezqmy:vy) 
+ If — Plizew.w + IPaCF) — Fllzgeo-w- 


Here, in the final step, we used the fact that || Ar(g)iidise < Ilglldisc for all g € 
L7; V), since Y;, is an orthogonal projection, and likewise || P(g) Iz2~D;v) < 
IIgllz2(;v)- This completes the proof. oO 


We now consider Theorem 3. This is most commonly established using the 
matrix Chernoff bound [93, Thm. 1.1], which we restate here for convenience: 


Theorem 4 (Matrix Chernoff Bound) Let X,,..., Xm be independent, self- 
adjoint random matrices of size s x s. Assume that X; is positive semidefinite and 
Amax(X;) < R almost surely for eachi = 1,...,m, and define 


m m 
min = Amin (> xi » max = Amax (> x : 


i=l i=1 


Then, for0 <6 <1, 


F (a (> x) S(l=48) na <s- exp ( Hmin((l = 8) mee =0) ~) . 
i=l 


and, for 5 = 0, 


P (aa (: x) > (1 +8) a) < s-exp ( poo! _ +3) =m), 


i=l 


Proof of Theorem 3 Let {v;}ie[s] be an orthonormal basis of Ps C L2(D) with 
respect to p. Let p € Ps, p # 0, be arbitrary, and write p = )~ cjuj and 
C= (Ci )iefs]- Then, 


2 = 


ie€[s] 


2 


do(y) = >> lei? = llell3, 


ié[s] 


Yo civi(y) 


ieS 


and 


1 m 

2_ ox 
— J wip? = Ge, 
me i=1 
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where G = A*A e€ C**® is the self-adjoint matrix with entries Gj, = 


4 1 WO) (U;, Uk) 13(D)- It therefore suffices to show that Amin(G) > 1 — 6 


and Amax(G) < 1+.6. Write 
m 1 oe m 
C= Zn Xj = {= wou nua] 
[= 


jk=l 


By construction, these matrices are independent and positive semidefinite. Also, 


ma 


——— 1 
1x) j0= | vj (y)ug(y)w(y) —dui(y), 
D m 


which gives 


m 1 m 
(E200 = i vj OUEOWO)— J dui) = 1: vj (uE@do(y) = 8j,e- 
i=l i=l 


ik 


Hence, )~?"_, E(X;) = J is the identity matrix. Moreover, for any c € C’, we have 


2 
IIell. 


2 
eis 


jes 9 
L5(D) 


2 
2 
a — (MPs. wyy 
m 


WN (Ps, w))? 
Yo cjiVwOi)v; (vi) = ee 
A) 


1 
m |4 
j 


Since these matrices are self-adjoint and positive semidefinite, we deduce that 


2 
moje 2 
m 


We now apply the matrix Chernoff bound (4) with s, R = (W (Ps, w))? /m and 


m 


Hmin = Armin (>: «6 = Amin(Z) = 1, 


i=1 


and likewise max = 1. This gives 


P Amin(G) < 1 — 6 or Amax(G) = 1+ 4) 
< PQmin(G) < + 4)) +P Amax(G) = CU — 8)) 


es-(ex ( ae SPY ( ceo eet) —*)) 
SN nT (N (Ps, wy? PM mV (Ps, wy) 
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Note that (1 + 6) log +6) — 6 < C1 — 6) log(1 — 5) + 6 for 0 < 6 < 1. Hence, 


cs 
P (Amin(G) < 1 — 6 or Amax(G) > 1+) < 25 - ex <€, 
(Amin(G) max (G) ) s (a) 


where in the last step we use the condition on m. This completes the proof. Oo 


4 Sparse Approximation via £!-Minimization 


Having discussed the case of Problem |, we now consider the substantially more 
challenging setting of Problem 2. In order to facilitate its solution, we now also make 
an additional assumption on the dictionary ®: namely, the index set .% is finite, 
and the functions ¢,, 1 € -%, are linearly independent. In what follows, we write 
n = |.#%|. Note that, typically, n >> m, where m is the number of measurements. 
It is notable that the examples described in Sect. 2.3 correspond to cases where the 
index set is countable. In this case, one may define -¥ as a large but finite truncated 
index set in which the target set S in the sparse representation (5) is expected to lie. 
We shall return to this matter briefly in Sect. 6 (see Remark 12). 

This aside, we now also assume that uw; = ... = my = in Assumption |, i.e. 
4 is a probability measure that is absolutely continuous with respect to » and for 
which the Radon—Nikodym derivative is strictly positive almost everywhere. In this 
case, the corresponding weight function w satisfies 


1 
du(y) = ZO) dp(y). (38) 


This is done to simplify several of the arguments later. However, it is also possible 
to consider distinct measures /41,..., [4m as in the previous section. 


4.1 Formulation 


Given ® = {¢, : 1 € -Y}, we strive to exploit the fact that f is assumed to have an 
approximate sparse representation in ®. In this section, we do this via minimizing 
the £'(.7; V)-norm of the coefficients, while also promoting fidelity of the resulting 
approximation to the measurements (2). There are various ways to do this, including 
the (V-valued) Quadratically Constrained Basis Pursuit (QCBP) 


a : 1 m 
f € argmin jAllellacg-y) : — ) won fon +21 — pODlly <7" F. 

peP an, un i=l 
(39) 
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the LASSO 
1 m 
fe argmin }Allellecs.y + — Yo) wOvdilf On) +m — roo (40) 
peP g. Va isl 


or the Square-Root LASSO (SR-LASSO) 


m 


fe argmin 4 Allellacg.y+ |— > wOodllfod +n - pODliy fp. 4D 
pEePe%, erm 


Here, in all cases, c = (C,)ie.y € Vj} are the coefficients of p = er Cd € 
P.g.y,. The focus of this work is not the choice (39), (40) or (41). We remark 
that (41) enjoys a known advantage over the other problems, in that the theoretically 
optimal value of the tuning parameter 4 is independent of the noise (see also 
Theorem 5), which in this case also includes the typically unknown error f — fig. 
For further background and in-depth comparison of these optimization problems, 


see [10]. 
Now let 
se (vn 1600) cm sem Or (42) 
m); je[n] 
= — i ; i vr 43 
a = (Vwonron +m), VF (43) 
where {t1,...,¢,} is an enumeration of the indices in .%. Notice that we use the 


same notation for this matrix as in the previous section (see (15)). However, it is 
important to note that this matrix is generally short (fat), since n > m, whereas 
the matrix (15) is m x s, and therefore tall. Then, ra = ber C.d, is a solution 
of (39), (40) or (41) if and only if ¢ = (€,),c.7 € Vj, is given by 


P ; 2 
¢é € argmin {alleen + ||Az — IZ em:v)} ; 
zeVi 


2 
¢ € argmin {allzllerg In]:v) + Az — VI zam:v)} 
zeVi : 


or 


c € argmin {Ilzlleg ([n]:V) * : || Az — Vile ([m];V) X n} ; 
zeVi 


respectively. 
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Remark 5 (Algorithms for Solving (39)-(41)) The V-valued versions of the QCBP, 
LASSO and SR-LASSO problems can be solved by considering reformulations 
of standard methods for solving their real- and complex-valued counterparts. For 
example, in [36], the LASSO problem was solved by extending Bregman iterations 
and forward—backward iterations to the V-valued case, while in [12] the V-valued 
SR-LASSO problem is solved via primal—dual iterations. We shall not describe 
algorithms for solving (39)-(41) in any further detail and refer the interested reader 
to [12, 36]. 


4.2 Accuracy, Stability and Sample Complexity 


As in Sect. 3, our main assumption will be a condition of the form (17), but with 
two differences. First, since the target set from which the sparse representation of 
f is obtained is unknown, we require this to hold for all subsets, not just a fixed 
subset. Second, as we see in the theorem below, we also require for it to hold for 
some value t > s. The precise condition is as follows: 


m 

allPliap) <= Y\wovlpovl? < Blip YP Pr, TS, ITI St. 
i=l 

(44) 


In addition to this, we also recall that @ is a finite dictionary consisting of linearly 
independent elements. Therefore, it is a Riesz basis, meaning that 


2 


2 2 
allellacg) < > CP < blicllacgy Veo = (Chev € Cc", (45) 
1EF L2(D) 


for constants 0 < a < b < o. Finally, before stating the main result, we need some 
additional notation. We write 


Os(X)eig-y) = inf { |x — Zlleicg:y) 12 € V" is s-sparse} , xey". (46) 


Here, we recall that a Hilbert-valued vector z = (Z;)ie{nj € V" is s-sparse if it has 
at most s nonzero entries, i.e. |{i : zj A O}| <s. 


Theorem 5 (Accuracy and Stability of ¢'!-Minimization) Let © = {¢, : 1 € 
AVC L?(D) be a finite dictionary consisting of n linearly independent functions, 
with bounds a,b > 0 as in (45). Let 1<s<n,0<a<B <o, {yj}lL, CD, 
w: D => [0, c) be such that w(y;) is well defined for all i, and suppose that (44) 
holds with t = min{n, 2[4s 784), Let f € LD; V) with measurements (8) and 
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consider the problem (41) with X < pee, Then, any solution f of (41) satisfies 


a Os (C) 61 (n};V) 
12(D;V) : JS 
+ c2 (lf ce fs \disc ae lf = Pri (Aldise ac llellezqmj:vy) , 


ln-f 


where ¢ = (()ie.¢, £9 = Yye.g (b& is the orthogonal projection (best approxi- 
i : 1 mn m 
mation) of f in P.g.y, e = Th (/wOi)ni),_, € V”™ and 


c= 8V/b, oa = 86 (+ =) + 


This result (see Sect. 4.7 for its proof) shows stable and accurate recovery for 
the solution a of (41) (similar results can also be shown for (39) and (40)—see 
[10] and [5, Chpt. 6]). Specifically, the error is bounded by a multiple of (46) and 
lf — f.zlldisc, which together measure how well f can be represented by an s- 
sparse representation in ® (observe that these terms vanish when f has an exact 
S-Sparse representation). The other terms are the space discretization error and the 
noise error. As in the case of weighted least squares (see Remark 1), it is also 
possible to replace the ||-||gjg¢-norm by the L?(D)-norm when the sample points 
are random variables. We also remark in passing that the factor 8 in the constants is 
somewhat arbitrary. Other numerical values could also be used, subject to changing 
the numerical values in the definition of ¢ and i. 

We next consider sample complexity. The following result is analogous to 
Theorem 3 for the case of compressed sensing. 


Theorem 6 (Sample Complexity for (44)) Let d = {¢,:1 € YF} Cc EAD) bea 
finite dictionary consisting of n linearly independent functions, with bounds a, b > 
0 as in (45). Let ts be a probability measure satisfying Assumption I, 1 < t <n, 
0 < 6 < 8* for some universal constant 0 < 6* <1,0<¢€<10<a<B<o@ 
and y\,..., Ym be independent with y; ~ wu fori = 1,...,m. Define 


r=T(,w)= [max [ Vnoie.coi}] : (47) 
EF L&(D) 


where w is the weight function specified in (38), and suppose that 
wECS aI falas (log(en) - log” (e(I"?/a)t/5) + log(2/e)) ; 


for some universal constant C > 0. Then, (44) holds with1-6<a<B<1+6, 
with probability at least 1 — €. 
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As in the least-squares case, this result reduces the question of sample complexity 
to the matter of estimating a certain constant ’(®, w) depending on the system ® 
and the weight function w. Observe that [7 > a for any ® and w. Indeed, I”? 
wola(y) \ almost everywhere, and therefore (7) and (45) give Pes ld. IF. (D) 

Pp 
a. 


Combining this with Theorem 5, we deduce the following: 


IV_IV 


Corollary 1 (Sample Complexity of £'!-Minimization) Let ® = {¢, : 1 € 4} C 
Li(D) be a finite dictionary consisting of n linearly independent functions, with 
bounds a,b > O.as in (45). Let 1 <s <n,0 <€ < 1, uw bea probability measure 
satisfying Assumption 1, y\,..., ¥m be independent with y; ~ u fori = 1,...,m 
and I" be as in (47). Suppose that 


m > (b/a)-(I"7/a)-s- (log(en) - log? (e(b/a) ("7 /a)s) + log(2/e)) 


Then, the following holds with probability at least 1 — €. Let f € Li (D; V) with 
measurements (8), and consider the problem (41) with A and b as in (42) and d = 
c./a/s for some 0 < c < C, where C > 0 is a universal constant. Then, any 
solution f of (41) satisfies 


PI ND 
L2(D;V) ~ Js 


|r-/ 
+ VoJa (lf - fr ldise +f — Pa Mldise + lelleeqmiv)- 


Proof Theorem 6 and the condition on m show that (44) holds with 6 = 6*/2, 


where tf = min{n, 2[4s te We now apply Theorem 5, noting that c) < /b 
and cz S$ (1/e+ 1) /b/a +1 Sc Vb/a in this case. Oo 


Remark 6 (The Riesz Basis Constants a and b) On closer inspection of the proofs, 
it is evident that it is possible to somewhat relax the assumption (45) by requiring 
only sparse subsets of ® to form Riesz bases (with the same bounds). In Theorem 5, 
for example, it is possible to replace (45) with the weaker condition 


2 


allclliagg, < | lob <dlleliagg), C= (Ges € C" is t-sparse, 
eS L3(D) 
(48) 


where t = min{n, 2/45 PB }. Even when @ forms a Riesz basis, when s < n, the 
corresponding constants in (45) may be significantly better behaved than the Riesz 
basis constants in (45). In Sect. 5, we see an example where the lower constant a 
in (45) is extremely small, yet recovery is still possible from a reasonable number 
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of measurements. This suggests that it may be important to use (48) instead of (45) 
in some scenarios. 


4.3, Monte Carlo Sampling 


We now discuss the case of Monte Carlo sampling, which corresponds to the choice 
lL = p, ie. w(y) = 1. Corollary 1 shows that the sample complexity of ¢!- 
minimization is determined by the constant I” defined in (47). In this case, we have 


I = max op) = A=0(9), 
_ IIPill ze cD) (®) 
which leads to the sample complexity bound 


m 2 (b/a) - (@?/a) + s+ (log(en) - log?((b/aje(O?/a)s) + log(2/e)). (49) 


It is worth comparing this bound with the least squares bound discussed in Sect. 3.3. 
Let S C Y,|S| < s and p € Ps be arbitrary. Write p = ed c.@,. Then, 


Ipllzewy < d_ ledllbllnewy < OVS [DV leul? < OVSIlpll Lz )/V4. 
EF EF 


Hence, (18) gives that 
(W(Ps))° < 7s/a. 


Therefore, unsurprisingly, the sample complexity for ¢!-minimization in the setting 
of Problem 2 is always at least as large as least squares in the setting of Problem 1. 

On the other hand, there are clearly instances where both sample complexities 
are the same, at least up to log terms. Recall that the functions ¢, of Example 1 
are equal to one in absolute value. Therefore, © = | and, as discussed previously, 
(WV (Ps))* = s. Since this is an orthonormal basis, we also have a = b = 1 in this 
case. Hence, (49) reads 


m>s- (log(en) -log2(es) + log(2/e)) 


We conclude that Monte Carlo sampling in combination with least squares (in the 
setting of Problem 1) or ¢!-minimization (in the setting of Problem 1) is near optimal 
for sparse approximation via trigonometric polynomials. 
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By contrast, in the case of Example 2, the size of © depends on the choice of the 
finite index set .%. Using (9) and (25), we see that 


d 
IPllzec) =[[@u+ dD,  6= Gf. 
k=1 


and therefore 


d 
©? = max 2 1)>. 50 
wos {TK ie + | (50) 
k=1 
Hence, if, for example, 
= giP 


is the tensor product index set of order s (see (10)), then it follows immediately that 
@? = (25+ 1)%. 


Thus, the sample complexity bound behaves like s(2s + 1), up to log factors. This 
grows exponentially with d and always substantially exceeds s. Furthermore, since 
n = |.%| = (s + 1)? in this case, this means that the sample complexity bound 
actually exceeds n, a situation that is, naturally, undesirable. 

This situation can be ameliorated by choosing a truncated set .% with fewer high- 
order polynomial indices, at the potential cost that important terms may be missed 
in the truncation. For example, let 


Se ia 
be the hyperbolic cross index set of order s — 1 (recall (12)). Then, (50) gives 
5 d 
@* <2 ine {T1« + | < 245, 


Hence, the sample complexity behaves like 27s, up to log terms—in other words, 
substantially better than in the case of the tensor product index set, but still 
exponentially large in d. Note that this bound is well suited when d is comparatively 
small in relation to s. In the setting where d is large, one can also show that 


glo8G)/ logs) 73 < oe? = gl0g3)/log(2) | (2e= 24. 


see [25, Lem. 3.5]. Thus, for large d, the same complexity bound scales like 
slesG)/los@+l 5258) up to log terms—in other words, polynomial in s, 


wn 
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independently of d, albeit with a scaling that is substantially bigger than the optimal 
linear in s scaling. 


4.4 ‘Optimal’ Sampling 


With this in mind, we now consider how to choose the sampling measure to obtain 
a smaller sample complexity. Following ideas of [50], our aim is to minimize the 
quantity J” defined in (47). This is achieved by setting 


Maxie g xeD) 


(w(y)) | = aa 


1/2 
,  6=0(6):= (/ max (60)? dot) J , 
DicF 

(51) 


Notice that I” = @ for this choice of w, which gives the sample complexity bound 


m & (b/a) - (6/a) +s - (log(en) - log*(e(6?/a)(b/a)s) + log(2/e)), (52) 
provided yp satisfies (38), ie. 


maxyeg |b)? 


du(y) = 92 


doe(y), y € supp(p). (53) 


Observe that the constant @ is always no larger than the constant © that appears in 
the Monte Carlo sampling estimate (49). Hence, we expect this choice of measures 
to be no worse than Monte Carlo sampling. 


Remark 7 (The Gap Between Problems 1 and 2) In the setting of Problem 1, 
we obtained a sampling measure in Sect. 3.4 that leads to near-optimal sample 
complexity, scaling linearly in s for any ® and subset S of size |S| = s. Critically, 
this measure depended on the known, target set S. Conversely, in the setting of 
Problem 2, the measure defined above does not, in general, lead to near-optimal 
sample complexity. Indeed, it is not generally the case that 9 = 1. This constitutes a 
key gap between the two settings. That it exists should come of little surprise. The 
sampling measure used in the former setting depends completely on the target set. 
Yet, the whole purpose of the latter setting is to compute sparse approximations in 
the absence of this assumption. Hence, it is not unexpected that the sample measure 
defined above (which depends on .¥ but not S) is not generally optimal. 


Remark 8 (Arbitrarily Large Improvements Are Possible) On the other hand, there 
are cases where the above sampling measure leads to substantial theoretical 
improvements. For example, let D = [0, 1], do(y) = dy and 


bly) =y “ exp(2zity) 
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be the trigonometric polynomials scaled by a weight factor y-® for some 0 < a < 
1/2. Notice that {p, : 6 € Y%} Cc Li (D) forms a Riesz basis for any finite 7. 
Clearly, in this case, one has © = oo. On the other hand, 


2 : 2 1 
e2= | ydy= 
[> a T= oe 


is bounded, for any choice of .%. The reason for this is that the functions @, are all 
singular, yet their singularity occurs at the same place y = 0. The measure yz, which 
has the form 


2 —2a 
max,<.g |¢,(y)| do(y) = y 


du(y) = ro (=o 


dy, 
samples more densely near y = 0, thereby capturing the common singularity more 


efficiently. Note that a similar scenario occurs in the setting of algebraic polynomial 
approximation on the real line via Hermite polynomials; see [50, 58]. 


4.5 ‘Optimal’ Sampling and Discrete Measures 


As in the context of least squares, drawing samples from the measure (53) can be 
challenging. Fortunately, we can overcome this issue in the same way by introducing 
a finite grid. Let Z = {ars be such a grid and 


ae Es (54) 


be the discrete uniform measure supported on it. We then define the corresponding 
discrete measure ju as 


Maxje.g lp(y) 7 


5 dr(y), 


du(y) = 


where 


1 k 
Des. 2 ee 2 
0 = [max id) dt(y) = i 2 mas le : (55) 
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In other words, 


k y)2 
w= ». ae |. (zi) ian (56) 
i=l ei MaXxc.g Ib.(z,)I? 


Sampling from this measure is achieved as follows. Define the matrix 


B= (i CD/VE) cg CO 


ie[k], jel] 


and notice that 


k 
67 = max | B;;|?. 
2s max | i 
i=l 
Hence, y ~ wif 


max je{n | Bij |” 


k 9 
Dini MAX jetny | Bij|* 


Py =z) = 


As in the least-squares setting, sampling with respect to this measure, following the 
sample complexity bound (52) (with @ as in (55)), is sufficient to ensure an estimate 
with respect to the discrete measure t. From this, one can also obtain an error bound 
with respect to the original measure p, whenever the grid is sufficiently fine. Indeed, 
suppose that 


@'IPliiacy = lellizw) SF llPlliny = YP € Pv (57) 


for constants B’ > a > O (note that this is analogous to (32), the difference 
being that we now require it to hold over Pg). Then, (44) holds with respect to 
the measure with constants a’a and 6’B whenever it holds with respect to the t 
measure. The key point is that the grid is required to satisfy essentially the same 
condition (57)—in other words, it should give rise to a discrete norm on Pv. One 
can construct such a grid exactly as in Remark 4. 


4.6 Further Discussion and Numerical Examples 


We now numerically examine Question 3 in the context of Examples | and 2 (we 
discuss Example 3 further in Sect. 5). Since the complex exponentials have absolute 
value equal to one, Example | is an instance where 6 = © = | and where yw = p. In 
other words, much as in the setting of Problem 1, sparse approximation in the case 
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of Problem 2 using trigonometric polynomials is possible via Monte Carlo sampling 
with a number of samples that is proportional to s, multiplied by several log terms. 

We next consider Example 2. Here, we recall from (50) that the relevant constant 
©? for Monte Carlo sampling can become arbitrarily large depending on the choice 
of .%. By contrast, we now show that this situation cannot occur when sampling 
according to the measure (53). 


Proposition 1 Consider Example 2. Let Y be a finite index set and O = {, : 1 € 
JS}, where the ¢, are as in (9). Then, the constant 9 = 0(®) defined by (51) satisfies 


67 < 24, 


In particular, when sampling from the corresponding measure (53), sample com- 
plexity estimate (52) is implied by 


m>24.s. (log(en) - (d + log(es))? + log(2/e)) ; (58) 


Proof The univariate Legendre polynomials ¢, satisfy the envelope bound 


2 
Id(y)| < Vea — yd l<y<1,c1ENo; (59) 
1 4 dy _ 
see, for example, [2, Eqn. (5.3)]. Observe that a wet 2. The result now 
follows by taking tensor products. oO 


This result states that the sample complexity for the ‘optimal’ measure is linear 
in s, up to a constant that scales at worst like 2“. This is a marked improvement over 
the sample complexity bounds for Monte Carlo sampling. Moreover, as we see in 
the examples below, the constant 6” can be substantially smaller than 24 for certain 
choices of index set 7%. 


Remark 9 (The Preconditioning Scheme) The envelope bound (59) suggests an 
alternative strategy for choosing w, based on the choice 


d 


w(y) = []@/2a - yp”. 


k=1 


This is sometimes termed the preconditioning technique for sparse approximation 
with Legendre polynomials [58, 79]. The corresponding measure ju is precisely the 
arcsine (Chebyshev) measure. Because of (58), it leads to the same sufficient sample 
complexity bound (58) as sampling via the ‘optimal’ measure (51). 


We now explore this effect numerically in the setting of Example 2. In order to 
avoid the difficulties of sampling from the continuous measure (53), we instead use 
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Tensor product: .% = .4"P Total degree: .% = .41P Hyperbolic cross: .% = .gHC 
aad 7 ed = 1 
d=2 d=2 
10000} =a 10000] a=4 
ad = 8 ad 8 
8000 wd = 16 8000 yd = 16 


6000 6000 


4000 4000 


2000 2000 


Sse ess eee ee ee 
1000 = «2000» 3000S 4000» 5000» 000 1000 = 2000-3000 40005000 6000 1000 = 2000-3000 40005000) 6000 


bad 1 25 
Hed =2 
-s-d=4 


10 


+ 


ca ‘a ar) 
1000 = 2000-3000) 4000» 50006000 1000 = 2000 3000» 4000» 50006000 1000 =. 2000» 3000) 4000» 5000 =~ 6000 


Fig. 4 The constants © (top row), given by (50), and 6? (bottom row), given by (55), against 
n = |.¥| in the case of Example 2 for different choices of .Y and dimensions d 


the discrete measure (56) throughout. The fine grid consists of k = 10n Monte Carlo 
points (recall Remark 4). 

In Fig. 4, we plot the constants ©? and 6? for several different choices of .%7. We 
notice several key effects. First, the constant 67 is small—in fact, no greater than 
= 25 in all cases. It is much smaller than the bound 2¢ shown above (notice that 
2!6 — 65,536), which appears to be very pessimistic in practice. We also observe 
that 6? is several times smaller than ©7, suggesting better sample complexity 
when sampling from (56) instead of Monte Carlo sampling. On the other hand, 
the difference between the two quantities lessens in higher dimensions. This is not 
surprising, since the bad scaling of © is caused by the presence of high polynomial 
indices. For fixed maximum size |.%7| = n, the index set .% contains fewer higher 
order polynomials in higher dimensions than in lower dimensions. This suggests 
that Monte Carlo sampling may become more acceptable in higher dimensions. We 
show this effect in more detail next. 

In Figs. 5, 6, 7, we consider function approximation using the two sampling 
strategies. As we see, the ‘optimal’ strategy gives a nonnegligible benefit over Monte 
Carlo sampling in lower dimensions for f; and f2. Yet, in higher dimensions, this 
benefit lessens. On the other hand, for f = 3, the ‘optimal’ strategy yields no 
better performance, and actually a larger error than Monte Carlo sampling in high 
dimensions. The lessening benefit with increasing dimension is consistent with the 
results of Fig. 4, wherein it is shown that the difference between the constants 
©? and @? decreases as d increases. On the other hand, the observation that it 
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(d,t,n) = (8,22, 1843) (d,t,n) = (16, 14,4385) 


Fig. 5 The relative error (35) versus m for ¢!-minimization in the case of Example 2, where 


I= IHC is the hyperbolic cross index set (12) and f = /f; is as in (33). This figure compares 
Monte Carlo uniform sampling (labelled “LU’) and sampling from the discrete ‘optimal’ measure 
(56) (labelled ‘LO’) 


can sometimes yield worse approximations is an important reminder that sampling 
strategy is designed to enhance the performance of sparse approximation in general 
and may not therefore be the best strategy for any fixed function. 


Remark 10 To elaborate on this previous comment, notice that the ‘optimal’ 
measure samples more densely near the boundary of the hypercube [—1, 1]“, where 
the Legendre polynomials are larger, and therefore less densely near the origin. 
Hence, any function that varies most significantly in the interior of the domain is 
liable to be less well approximated by sampling from the ‘optimal’ measure. This is 
the case in particular for the function f = f3, which is a product of one-dimensional 
functions that are peaked around centres that get progressively closer to the origin 
with increasing index i and are relatively flat away from their centre. 
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Fig. 6 The same as in Fig. 6 but with f = fo as in (33) 


4.7 Proof of Theorems 5 and 6 


Throughout this section, if x = (x;)_, € V" and S C [n], we use the notation 
xs € V" to denote the vector with ith entry equal to x; if i ¢ S and O otherwise. 
Note that xs is isomorphic to a vector in V!5!, We will sometimes consider it as an 
element of this space. We now recall the following definition and lemma, which can 
be found in [11, Defn. 6 & Lem. 7]: 


Definition 2 A matrix A € C’™*" satisfies the robust Null Space Property (rNSP) 


of order 1 < s <n over V” with constants 0 < p < landt > Oif 


Pllxse lle ]:V) 
Ilxslleqnivy) S oo TW|Axllecimvy, Wx € ye 
/s 
for any SC [n] with |S] <s. 


Here, in the second term, we recall that a matrix A € C”%*” extends in the 
obvious way to a mapping V” > V". 
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Fig. 7 The same as Fig. 5 but with f = 3 as in (33) 


Lemma 3 Suppose that A € C”*" has the rNSP of order 1 < s < n with constants 
0<p<landt>0.Letx €V",v=Ax+eeV" and 


Ci 
Xr < ’ 
C2./8 
where C, = Sept and C2 eS Then, every minimizer x € V" of the 


Hilbert-valued SR-LASSO problem 
min Allzlletqajv) + Az — vlledmajyy 


Satisfies 


Os (*)et in} V) 


‘ Ci 
|| _ x] 2qniw Ss 2C} Js + (= + ca) lelle2qmy:v)- 
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Proof of Theorem 5 We first claim that the matrix A defined in (42) satisfies 
aallell>anjv) —_ Aching ({m]; vy - < bBllelling [n]; Vv)? (60) 


for all vectors c € V" that are t-sparse. Observe that any such vector c corresponds 
to the coefficients of an element p € Pr.y for T C 4%, |T| < t, and also that 
Acting (m:v) = =i ™  w(yi)\| pQu) ly. We now recall from Lemma 2 that (44) 
also holds over Pr.y whenever it holds over Pr = Pr.c. Hence, 


alle cay < WAclle amv) = BllP lia c.vy- (61) 


Now, using almost identical arguments to those used in the proof of Lemma 2, we 
find that the Riesz basis condition (45) also extends to the V-valued case: 


2 


2 2 
allcliacg.y S|, ob <biiclaggy Vo= Ces EV". (62) 
ef N112(D;V) 


Since p = > <g Ch, the claim now follows from this and (61). 

We now show that A satisfies the rNSP of order s over VY” and derive values 
for the constants p and t. The following argument is based on [5, Lem. 13.8]. ra 
c € V" and S C [n] with |S| < s. Suppose first that t < n, so that t = 2[4s 28 | 
(we consider the case t = n later). Define a partition A;, A2,... of S° as follows. 
First, let A; be the index set of the largest rt’ = t/2 (notice that this is an integer, 
due to the definition of r) indices of the vector (||cj|ly) jese, Az be the index set of 
the largest t’ = t/2 indices of the vector ((|c; lly) je(suA,)¢ and so forth. This gives 
a partition of S° for which each set is of size t’, except possibly the final set (this is 
of no consequence to the argument). Consider the set § U Ay D S. Since s < t', we 
have |S U A,| <s+t' < 2t’ =t¢. Hence, we may apply (60) to obtain 


5) 2 1 2 
Il Psclla anv) = | Psuaiell pany.vy = —[APsuaicl egm;v: (63) 
We now write 


2 
|APsuaicll gary = (APsuayc, AC) ¢2(¢m}:V) _ Y (APsuare, AP4,C) ¢2(¢n}-V) 


i>2 


and then apply the Cauchy—Schwarz inequality and (60) once more to get 


|APsuaicll egnw S WAcllegmvy + vob) Pae| Bigaags) 


i>2 
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Notice that, by construction of A; for each i > 2, we have 


[Pavclicgnvy Se max [ej st” min les. 


Since | Pa;_ > minjea;_ llc; lees this implies the following: 


ellag (nl;V) = 
| Pacell ean: s “| Pai-1¢lleiqn:vy 122, 
Using this and the previous expression, we deduce that 


/bB 
|APsua.cll eqm:vy S llAclle2qm.vy + ra >, Pa, panty 


i=l 


bp 
= lAclegmvy + y Fl Pseellean:y- 


Substituting this back into (63) and noticing that bB/(aat’) < 1/(2,/s), we get 


1 1 
Il Psclle2qnj;v = Jag |ACleamvy + a fg lPseelleran:yy> (64) 


for the case t < n. Now suppose that t = n. Since (60) now holds for any vector 
c € V", we easily see that 


1 1 
Il Pscllezqnj;v) < Jag |ACleamvy < Jag |ACleamvy + a plPocll [n]:V)- 


Hence, (64) also holds in this case as well. We deduce that A satisfies the rNSP of 
order s over VY” with constants p = 1/2 and t = 1/,/aq. 

Having shown this, we complete the proof by establishing the error bounds for 
7 . Let ¢ be the coefficients of f . By the triangle inequality, we have 


lr-A| 


2, . 
12(D;V) 


SIF — Pua Dewy + FD Paizo + |F0L ~ Fa o.wy 


Therefore, since Y;, is an orthogonal projection, we have 


Iv-f 


2 é 
12(D;V) 


< If — Px Naw + If — Feller + | Pulfv) - 


2 : . 
Li,(D;V) 
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. By (62), we have 


We first bound the t |2 /)—f 
e first boun e term n( fs) Aaw.w 


| Pur - F 


< Vb| Pao) — all eqnj-yy: 


= | Yo (Palo) = db. 
EF 12(D;V) 


L3(D;V) 


where Y;,(c) is the vector (Y_(c,)),c.g. Using Lemma 3, we get 


g _ a 
| n(p) — f DW) 
Os(Pn(c))¢1qn}:v) ( 1 1 ) ) 
< 8vb AP)(c) — ll eaqm:v ) - 
< ( cs oF ras + ae |A Fa (c) — bllecn;vy 
(65) 
Since Y;, is an orthogonal projection, we have 
Os(Pac))eiqnyv) < Fs (C)¢1 qn}; v)- (66) 


Moreover, using (42) and (43), we see that 


|A Pa (c) — V|le2(u}:v) 


1 
=| = WF, )) — fQi) — ni 
Fa (Vw Pa F — FO) =) coal array 
1 
< | (Yoon ALO) — Pu NOD) 
a (Voor n( fo Y n( Py Dean ee 
1 
+ | ag (VOD EOD — PxMOD)), caf saonay * Heleue 


Now, observe that 


| won alton) -— ACNOD|, = VwOdIFe OD — FOdIIv. 
We deduce that 


|AFa(c) — vlleqm:vy) < If — Fe lldise + IF -— PaMldise + llellezqm:vy: 


Substituting this into (65) and then combining with (66) now complete the proof. 
oO 
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We now prove Theorem 6. For this, we use the following result, which was shown 
in [20, Thm. 1.1]: 


Theorem 7 There exist absolute constants k,co,c1 > 0 such that the following 
holds. Let X\,..., Xm be independent copies of a random vector X € C" such that 
IX loo < K almost surely for some K > 0. Let Z C {ee C": |Iclla < Js}, 
6 € (0, x), 0 < € < 1, and suppose that 


m > cy. K2.872.5- (log(en) log?(sK2/5) + log(2/e)) 


Then, with probability at least 1 — €, we have 


1 2 ; + 2 
— S° Xi, ¢)/? - < 16 (1+ sup El(c, X)[?). 
Me ceT 


Proof of Theorem 6 Note that the result holds, provided 


(X, c)|* 


1 2 2 
= 2 WONIPODI? = llPli72 (py 


i=l 


<6, Vpe Pr, \lpllz2~p) <1, TS, ITI St. 
Pp 


| m 


Define the random vector X = (Vw(y) hi; (y)) jetv], Where y ~ yw. Let p € Pr for 
some T C ¥Y with |T| < t, and write p = )0-7 cid). Then, 


1 m 1 m 
= 2, wolPoor =— d, [(Xi,c)|?, 
i= ‘i — 


and, since du(y) = (w(y))7! dey), 


2 
IIPlle2 cp) = i; IPQ)? dey) = i wy) | > cdb(y)} duty) = EX, e)/. 
e D D 
veT 

Hence, it suffices to show that 
1 m 
—J° (Xj, ¢)|? EX, c))?| <8, Vee J, (67) 
m ‘= 

where 

ZF ={c €C" : cis t-sparse and E|(X, c)|? < 1}. 
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Notice that if c € 7, then, by (45), 


2 


= /t/ayE|(X,c)? < Jt/a. 


L2(D) 
Hence, FY € {c € C"”: ||cllg: < /t/a}. Therefore, we may apply Theorem 7 to get 
that 


lela < Villell, < t/a 


ts CP 


l 


m 


1  oyj2 
— DXi 1? — 


i=1 


Cer 


ceT 


<5’ (: + sup E|(c, wf] < 2c}6', 


for 0 < 8’ < x. This holds with probability at least 1 — €, provided 
m > cy- K?- (8')~2 - (t/a) - (log(en) log2(tK2/(ad’)) + log(2/e)) 


Observe that ||X||,, = max,c.¢{./w(y)|¢,(y)|}, and therefore we may take K = 
r=\|l maxer{V W)C) Hlz2°(p)- We now set 5’ = 5/(2c 1) to deduce that (67) 


holds for 0 < 6 < «/(2c1) with probability at least 1 — €, provided 
m > 4coc? - I? .8-* - (t/a) (log(en) log?(2cy1?/(a8)) + log(2/e)) . 


To complete the proof, we simply notice that log(2cit I"? /(ad)) < log(et 7 /(a5))+ 
log(2c,/e) < log(et?/(aéd)), since t?/(ad) > 1. Oo 


5 A Novel Approach for Sparse Polynomial Approximation 
on Irregular Domains 


In this section, we focus on Example 3 in the context of £!-minimization. As we 
have seen in the previous section, the various measurement conditions depend on 
the Riesz basis constants a and b of the system ® = {@, : 1 € -%}. Unfortunately, in 
cases such as Example 3, these constants are often very poorly behaved, especially 
when the polynomial degree is large. Later, in Figs. 11, 12, 13, we compute a, b for 
various different domains. While we note that these constants may be pessimistic (in 
particular, recall Remark 6), the fact remains that dealing with poorly conditioned 
dictionaries may well be problematic in the setting of ¢!-minimization for sparse 
approximation. 
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5.1 Method 


Inspired by our earlier use of discrete measures to effect optimal sampling, in this 
section, we consider a different approach for approximation on irregular domains 
in which the original Riesz basis ® is orthogonalized over the discrete grid. 
Specifically, let Z = {z; a , be a finite grid and t be as in (54). Then, we construct a 
new basis Y = {u, : t € -¥} that is orthonormal with respect to t and subsequently 
use this basis in which to construct the approximation f to f via €'-minimization. 
As in Sect. 3.5, the orthonormal basis Y is constructed via QR factorization. Let 


B=(bj@/ve)_ eC, 


ie[k], jel] 


and suppose that B has QR factorization B = QR, where Q € CK©" and R € C”™", 
Then, this orthonormal basis is given by 


uy (y) = SR") jG), EE [n). (68) 


j=l 


In what follows, we compare two sampling strategies for the basis Y. The first is 
Monte Carlo sampling from the underlying measure T, i.e. 


1 
P(y = zi) = z (Fe IA (69) 


The second one is the ‘optimal’ sampling measure identified in Sect. 4.5. In this 
case, due to the orthogonalization, this is given by 


max jetny | Qi; |" 


PYy=aw=—= ; 
Din MAX je{nj | Qi; 


[kK]. (70) 


As shown in the previous section, the sample complexity bounds for these two 
strategies depend on their respective constants © and 6. Because of the previous 
definition of the basis, these are given by 


@ = O(Y) = Vk max |Q;(l, 
ie[k] 


jeln] 


6=0(Y)= [> max|Q;j/?, 
ie 


for the former, and 


for the latter. 
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5.2 Orderings 


Suppose that f has a sparse representation in the system @®. Then, it is not 
guaranteed to possess a sparse representation in the basis Y, since the jth basis 
function v, j is a linear combination of the functions ¢,,,..., ¢ i In particular, the 
sparsity of f in the representation Y will be heavily influenced by the ordering of 
the basis {t1,..., ¢n} of the indices in .%. 

To illustrate this, in Figs. 8, 9, 10, we compare ordering the multi-indices in .% 
lexicographically to ordering them according to increasing total degree, i.e. the value 
ut+...ttg for’ = oe: We remark in passing that ordering according maximum, 
ie. the value maxz=1,...a{tx} produces similar results. To examine the sparsity in 
either basis, we plot the coefficients of a function f sorted from largest to smallest 
in absolute value. In particular, the more rapidly the coefficients decrease, the better 
f is approximated by a sparse representation in the given basis. 

As is evident, lexicographic ordering always leads to a deterioration in sparsity 
when switching from the basis ® to the basis Y. This is of little surprise. On the 
other hand, using the total degree ordering can substantially improve the situation. 
In Fig. 8, it actually leads to better sparsity, and in Fig. 9 it yields better sparsity in 
d = 2 dimensions and similar sparsity ind = 8 and d = 16 dimensions. Finally, 
in Fig. 10, while still leading to worse sparsity than in the original basis ®, it is still 
generally better than lexicographic ordering. 
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Fig. 8 The absolute values of the coefficients of the function f = f; over the domain D = Do 
with respect to the bases ® and Y, where .% = gHC is the hyperbolic cross index set. The 
coefficients are sorted from largest in absolute value to smallest. In the top row, the multi-indices 
in ¥ are sorted lexicographically. In the bottom row, they are sorted according to increasing total 
degree (i.e. the value 4] + ...+ tg for’ = (x)4_1) 
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Fig. 10 The same as in Fig. 8 except for f = f2 and D = D, 


For this reason, in our subsequent experiments, we employ the total degree 
ordering. Naturally, this discussion leads to the question of the optimal ordering. 
We anticipate this to be function dependent, and it is outside the scope of this work 
to discuss it further. In practice, we expect a good ordering could be estimated from 
a set of candidate orderings via cross validation. 
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5.3 Numerical Examples 


In Figs. 11 and 12, we compare the orthogonalization strategy Y against the original 
Legendre basis restricted to the irregular domain D (labelled ®). Several effects are 
notable. First, orthogonalizing the basis generally leads to better performance than 
using the original Legendre basis. This is consistent with the observation that the 
various sample complexity bounds depend on the Riesz basis constants a,b > 0, 
which, as noted and as shown numerically in these figures, can behave wildly for 
irregular domains. On the other hand, it is clear that these constants are extremely 
pessimistic when it comes to predicting the actual performance. Even when the 
constant a = ay is exceedingly small, the approximation based on the Legendre 
basis still offers a reasonable error in most cases. Furthermore, as seen in Figs. 8 
and 9, the orthogonalization strategy leads to slightly improved sparsity. Therefore, 
it is unclear what property of orthogonalization is driving the better approximation, 
whether it be the smaller Riesz basis constants or the improvement in sparsity. On 
the other hand, in Fig. 13, we present an example where orthogonalization worsens 
the approximation. This we expect is due to the worse sparsity in the Y basis, as 
shown in Fig. 10. 

Second, we observe that the ‘optimal’ sampling procedures generally outperform 
Monte Carlo sampling in lower dimensions, while this improvement lessens in 
higher dimensions or may actually lead to slightly worse performance. This is 
consistent with the observation made previously in Sect. 4.6. We also report the 
values of the constants 8 and © in all cases. It is notable that, in the case of 
the orthogonalized basis, the corresponding constant © = @g is much larger 
than 6 = 6g, even in high dimensions. However, this is not reflected in the 
approximation errors for the two sampling strategies, which are similar, thus 
suggesting a gap between the theoretical guarantees and performance on actual 
function approximation problems. 


6 Structured Sparse Approximation 


Our main assumption throughout this chapter has been that f admits an approxi- 
mately sparse representation in a dictionary = {, : « € -7}. We conclude this 
chapter with a brief discussion on several types of structured sparsity models. Our 
focus throughout is on the setting of Problem 2. Hence, as in Sect. 4, we assume 
that @ is a finite set of n linearly independent functions. 
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Fig. 11 The relative error (35) versus m for !-minimization in the case of Example 3, where 


LF = 


IHC is the hyperbolic cross index set (12) and f = f; and D = Dp are as in (33) and 


(34), respectively. This figure compares the Legendre basis on [—1, 1]@ restricted to d and the 
orthonormal basis on D constructed via (68). In the former case, the sampling strategies are Monte 
Carlo sampling from the continuous uniform measure on D (labelled ‘LU’) and sampling from the 
discrete ‘optimal’ measure (53) (‘LO’). In the latter cases, the sampling strategies are Monte Carlo 
sampling (69) from the discrete uniform measure (‘QU’) and sampling from the discrete ‘optimal’ 
measure (70) (‘QO’). We also report the values of the corresponding constants 6? and ©? for both 
bases (labelled “L’ and ‘Q’, respectively), as well as the Riesz basis constants a and b with respect 
to the discrete measure Tt 
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Fig. 12. The same as in Fig. 11 except for D = D3 


6.1 Weighted Sparsity and Weighted £'-Minimization 


Let v = (v,),c.g be a vector of positive weights. For a set § C .%, we define its 
weighted cardinality as 


Sy ya 


eS 
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Fig. 13. The same as in Fig. 11 except for D = D; and f = f2 


In the weighted sparsity model, given weights v and a weighted sparsity k > 0, we 
assume that a function f € i (D; V) has a sparse representation fs of the form (5) 
for some set S C Y with |$|, < k. Note that, unlike the case of standard sparsity, 
the weighted sparsity parameter can take any positive value in this setting—hence 
our reason for using the notation k instead of s. 
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Fortunately, promoting weighted sparsity structure is straightforward. Rather 
than ¢!-minimization, i.e. (39), (40) or (41), we consider a weighted ¢'- 
minimization problem. For example, we may replace (41) by 


m 


, 1 
fe argmin 4 Allcllacg.y+ |— > wOdI fod +n — pOdlly fp. TD 
pePan, iad i=1 


Here, |[clleicg.y) = Yieg uilleiily is the weighted é!-norm of a Hilbert-valued 
vector c = (eg. 


Remark 11 Much like the lower set assumption (see Sect. 3.3), weighted sparsity is 
a natural assumption to consider when one expects the most significant coefficients 
of f to correspond to lower order terms. We discuss the relation between the 
two models later. This is typically the case for smooth function approximation 
using algebraic or trigonometric polynomials. As we see later, incorporating slowly 
growing weights can lead to a significant improvement over unweighted ¢!- 
minimization. Note that weighted sparsity and weighted ¢'!-minimization were first 
elaborated in [80], before further developments in [1, 2, 8, 25]. Other works on 
incorporating weights into sparse polynomial approximation include [3, 75, 99]. 


As in the case of standard sparsity, successful recovery via (71) follows from a 
norm equivalence similar to (44): namely, 


1 m 
allPliamy = So wonlpanl? < B\iPliapy VRE Pr, TOY, |Thy St. 
i=l 


(72) 


Under this condition, one obtains an error bound identical to Theorem 5, except with 
O5(C)¢i(.g:y) teplaced by the weighted term 


Sku Meeg;y) = inf {IIx — clleiig.v) +z € V" Is weighted (k, v)-sparse| , xev", 


and s replaced by k. For the sake of succinctness, we omit the details and refer to 
[8, 80] (the results therein are given in the scalar-valued case but readily extend to 
the Hilbert-valued case). 

Since our primary focus is on the question of sample complexity, we now state a 
variant of Theorem 6 for (72) for the weighted sparse model: 


Theorem 8 (Sample Complexity for (72)) Let & = {, : 1 € 4} C Li(D) bea 
finite dictionary consisting of n linearly independent functions, with bounds a, b > 
0 as in (45). Let js be a probability measure satisfying Assumption 1, v = (U,)e.g 
be weights with 


wz IVoObOllasay WEF, (73) 
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where w is as in (38), 1 < t <n, 0 <6 < 6* for some universal constant 0 < 5* < 
10O<e<10<a <8 < W,and yj,..., Ym be independent with y; ~ wu for 
i=1,...,m. Suppose that 


HEC eg a (log(en) -log2(eaq!t /5) + log(2/e)) 


for some universal constant C > 0. Then, (72) holds with -6 <a<B<1+6, 
with probability at least 1 — €. 


We omit the proof of this result, since it is similar to that of Theorem 7, the main 
difference being the use of [20, Thm. 2.13] instead of [20, Thm. 1.1]. 


6.2 Sparsity in Lower Sets 


The lower set sparsity model differs from the weighted sparsity model in that it 
imposes a lower set structure, as opposed to a weighted sparsity structure. But in 
practice it can also be effected via weighted ¢'-minimization with specific choices 
of weights. In the lower set sparsity model, we suppose that f € L2(D, V) has a 
sparse representation fs of the form (5) for some set S C .Y% with |S| < s that 
is also lower. In terms of sufficient conditions for lower set recovery, one’s first 
thought may be to consider a variant of (44) with the additional assumption that the 
sets T be lower. Unfortunately, it is not known whether such an approach can work. 
The difficulty lies with the fact that it is unclear how to promote lower set structure 
directly via a convex penalty term such as the ¢!-norm [8]. 

Instead, the approach originally proposed in [2, 25] is to use weighted sparsity as 
a surrogate for lower set sparsity. This is done by choosing weights u = (u,),¢.g as 
small as possible so that Theorem 8 applies, namely, 


u, = ||lVw) (Olea) We F, 
and defining the weighted sparsity as 
k = k(s; w) = max {|S|y : |S| <5, S lower}. 


Note that this ensures that every lower set of size s has weighted cardinality at most 
k, i.e. 


{S:|S| <s, S lower} C {S: |S|, < k(s; w)}. 


As a result, we can promote lower set sparsity by solving the weighted ¢!- 
minimization problem (71) with weights v = u. 


68 B. Adcock et al. 


Remark 12 (The Choice of 4) Working with lower sets also yields a strategy for 
choosing the large, truncated set .Y (recall the discussion at the beginning of Sect. 4) 
[8, 25]. Indeed, it is a straightforward exercise to show that the union of all lower 
sets of size at most s is the hyperbolic cross IHC Hence, the target lower set S in 
the sparse representation (5) is guaranteed to lie within this index set, thus giving a 
clear rationale for choosing this set. 


6.3 Sampling and Numerical Experiments 


We now discuss the matter of sampling. Notice that, unlike in the case of Theorem 6, 
the sample complexity bound in Theorem 8 does not involve a constant J” depending 
on the basis ® and weight function w. This dependence only arises in the minimum 
size condition (73) on the weights v. For Monte Carlo sampling (w = 1), this 
condition may be quite stringent if the L°-norms of the basis functions grow 
rapidly. Hence, this condition suggests choosing w to minimize the right-hand side 
of (73). This leads to the same choice (51) and (53) as in the standard sparsity setting 
considered previously. 

The case of lower set sparsity allows for a more concrete discussion. Choosing 
weights v = u as discussed above, and invoking Theorem 8, leads to a measurement 
condition of the form 


m>a'-k(s;w)- (log(en) -log?(ea7!k(s; w)/5) + log(2/e)) ; 


for recovering functions with sparse representations in lower sets. Hence, the 
objective is to minimize k(s; w) with respect to w. In the case of Monte Carlo 
sampling, we have 


k(s; w) = k(s; 1) = max {|S|,, : |S| <5, S lower}. 


Consider, for illustration, Example 2. In this case, since the Legendre polynomials 
all attain their maximum at the same point y = (1,..., iy", we have 


ISlu = (4 (Ps))’, 


where -V (Ps) is the unweighted Nikolskii constant (24). Recall from the discussion 
in Sect. 3.3 that (Ps) satisfies the sharp bound (WV (Ps))* < s* for lower sets. In 
other words, k(s; 1) = s? in this case, leading to a sample complexity bound of the 
form 


m > s?. (log(en) -log2(es) + log(2/e)) 
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In other words, the worst-case sample complexity for lower set recovery via Monte 
Carlo sampling is the same (up to constants and log factors) as that of least squares 
in the setting of Problem |. See [8] for further discussion. 

Having considered Monte Carlo sampling for lower set recovery, one may also 
consider how to choose the weight function w and the corresponding sampling 
measure jz to improve the sample complexity. The best solution in this case involves 
choosing w to minimize 


k(s; w) = max Ss IWwOb Olli) :|S| <s,S ove : 


eS 


over all strictly positive and finite almost everywhere weight functions on supp(:~) 
for which (7) holds. Unfortunately, even after resorting to a discrete measure as 
in Sect. 4.5, it is unclear how to compute such a w, since it seemingly involves 
enumerating all lower sets. As shown in [34], there are many lower sets in high 
dimensions (for example, at least ( 3) whens <d+ 1). 

Since the optimal choice of w (in the sense of minimizing k(s; w)) may not be 
available, it is natural to consider how one might choose a good w. One option 
involves the choice (51). This leads to the bound k(s; w) < 62s. This has the benefit 
of scaling linearly in s. But, it gives a sample complexity bound that is no better than 
the case of standard sparse recovery studied previously. Once more, the question of 
whether one can choose w in such a way to ensure optimal recovery (scaling linearly 
in s and at most logarithmically in d and n) is currently unresolved. 

We conclude with a number of numerical experiments. In Figs. 14, 15, 16, we 
consider Example 3 and employ the orthogonalization strategy of Sect. 5. We 
compare unweighted and weighted ¢!-minimization, where in the latter we set the 
weights u = (u,),<.¢ to be 


ur, = Illy); (74) 


where Y is the basis constructed via the approach of Sect. 5. In other words, these 
weights follow the approach discussed in Sect. 6.2 for promoting lower set sparsity. 

For sampling, we consider Monte Carlo sampling (69) and the discrete ‘optimal’ 
measure (70). In all examples, we see that weighted £!-minimization substantially 
outperforms unweighted ¢!-minimization. This is consistent with the observation 
that the larger polynomial coefficients tend to occur at smaller multi-indices—a 
property that the weights (74) promote by assigning larger weights to higher multi- 
indices. In terms of sampling, we observe the sampling measure (70) outperforming 
Monte Carlo sampling (69), where, as per usual, the benefit tends to lessen in 
higher dimensions. As discussed above, we do not claim that (70) is an optimal 
sampling measure in the weighted case: in fact, unlike in the unweighted case, 
it does not necessarily minimize the corresponding term k(s; w) in the sample 
complexity bound. Yet, these experiments appear to suggest that it is a useful 
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Fig. 14 The relative error (35) versus m for ¢!-minimization and weighted !-minimization in 
the case of Example 3, where % = IHC is the hyperbolic cross index set (12) and f = fj 


and D = Dy are as in (33) and (34), respectively. This figure compares Monte Carlo sampling 
(69) from the discrete uniform measure (labelled ‘QU’) and sampling from the discrete ‘optimal’ 


measure (70) (‘QO’) 


strategy when combined with weights to further enhance recovery of smooth 
functions via polynomials. 


7 Conclusions and Challenges 


The purpose of this chapter has been to explore the question of optimal sampling 
for learning sparse approximations in high dimensions. In the more straightforward 
setting of Problem 1, we showed how this can be almost entirely resolved by 
defining a sampling measure (or measures) in terms of the Christoffel function of the 
corresponding subspace. We remark in passing recent work that strives to go even 
further, by removing the log factor in the sample complexity bound; see [29] and 
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Fig. 15 The same as Fig. 14, except for D = D3 


the references therein. We note, however, that such procedures may not be feasible 
in practice or may not guarantee quasi-optimal error bounds. 

In the more challenging setting of Problem 2, we explored the limitations of 
Monte Carlo sampling and showed how to obtain a sampling measure that optimized 
the sufficient condition of the number of measurements. Empirically, this leads to 
improved approximation, especially in lower dimensional problems. Finally, we 
discussed structured sparsity, via either weighted or lower set sparsity, both of which 
can be promoted by using weights. Although here we were not even able to find a 
sampling measure to optimize the sample complexity bound, we found empirically 
that the same sampling measure used previously worked well in practice when 
combined with weights. 

The major open problem raised by this work is therefore: is it possible to design 
sampling measures for sparse approximation in dictionaries that are theoretically 
optimal, with sample complexity bounds that scale log-linearly in s and logarithmi- 
cally in n? Currently, we have no answer to this question. We note in passing that it 
may be important to take into account more refined structured of the dictionary. See 
[90] for recent work that uses the envelope bound (59) to derive improved sample 
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Fig. 16 The same as Fig. 14, except for f = f2 and D= D, 


complexity bounds in the case of Example 2 with Monte Carlo sampling. It is also 
notable that the various constants 6, © and a, b (in the case of irregular domains) 
often very poorly explain the observed performance. This is particularly notable in 
the case of a, b. This raises the question of a more refined analysis that avoids these 
terms. 

Let us also mention several extensions. First, while this work has focused on 
standard dictionaries consisting of algebraic or trigonometric polynomials, it is 
perfectly applicable to much more general dictionaries. This includes dictionaries 
now arising commonly in machine learning settings, such as random feature 
models or learned dictionaries obtained from deep neural network training. We note 
recent work on learning sparse representations in random feature models [54]. An 
interesting question for future work involves applying the techniques considered 
herein to these models, to obtain better sampling strategies for such dictionaries. 
This may be highly relevant for applications using machine learning techniques 
that are data-starved. There is also the problem of combining sampling, via the 
strategies discussed herein, with learning the dictionary in an adaptive way to boost 
performance. 
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Second, we note that the sampling model explored in this work is simple 
pointwise evaluations. It is possible to extend this to much more general sampling 
models, many of which occur in practical settings. An example is the discrete-in- 
space-continuous-in-time model, which can occur when sensors in physical space 
take continuous recordings a time-dependent function f(y, tf). Another problem, 
which arises commonly in uncertainty quantification (see [7, 45, 76], and the 
references therein), is the problem where one measures both the function f(y) and 
its gradient V f(y) simultaneously at a sample point y. We anticipate that many of 
the key results of this work can be extended to substantially more general sampling 
models. 
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a stationary point or a local minimum. For this setting, we first present known 
results for the convergence rates of deterministic first-order methods, which are 
then followed by a general theoretical analysis of optimal stochastic and randomized 
gradient schemes, and an overview of the stochastic first-order methods. After that, 
we discuss quite general classes of non-convex problems, such as minimization 
of a-weakly quasi-convex functions and functions that satisfy Polyak—Lojasiewicz 
condition, which still allow obtaining theoretical convergence guarantees of first- 
order methods. Then we consider higher-order and zeroth-order/derivative-free 
methods and their convergence rates for non-convex optimization problems. 


1 Introduction 


In this survey, we consider non-convex optimization problems in different settings, 
including stochastic optimization. We are mainly motivated by an increased interest 
in such problems in connection to applications in machine learning and data 
analysis, and our main focus is on the methods that possess theoretical guarantees 
for their global convergence rate or complexity. As we explain first by providing 
classical examples [181, 187], there is no hope to have any theoretical guarantees 
for finding a global minimizer in a general non-convex optimization problem 
in a reasonable time. Despite the quite good practical performance of classical 
general-purpose methods such as L-BFGS [97, 196], and proven local superlinear 
convergence, their global complexity is not well understood. 

In the last 20 years, theoretical analysis of the global convergence rate or global 
complexity guarantees has become de facto a standard in the area of numerical 
optimization. Since the convexity of the problem allows for such an analysis, 
many global complexity and convergence results have been obtained in convex 
optimization [22, 41, 86, 88, 156, 187]. Recent advances in machine learning, 
which were made possible by the application of neural networks, had led to the 
optimization community changing focus to non-convex optimization and, especially 
to stochastic non-convex optimization. In this non-exhaustive survey, we attempt to 
highlight the existing results on global performance guarantees of large-scale non- 
convex optimization methods. The large dimension of the decision variable in such 
problems motivates the use of first-order methods, which possess a cheap iteration. 
Moreover, the large amount of data motivates to use randomized methods such 
as stochastic gradient descent, which does not require to look through the whole 
dataset to make one step of the optimization procedure, thus making the iteration 
even cheaper. 

Since, in general, non-convex optimization problems cannot be made efficiently 
solved, we consider several ways to relax this challenging goal. The first relaxation 
consists of finding problems with hidden convexity or in a convex reformulation 
of the problem. This requires exploitation of the problem structure as much as it is 
possible, which limits the generality of the approach, yet leading to a possibility 
to find a global solution. Another way is to change the goal from finding the 
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global solution to finding a stationary point or a local extremum. In this case, 
it is possible to obtain polynomial dependence of the complexity of first-order 
methods on the dimension of the problem and desired accuracy. We consider this 
approach in the setting of deterministic and stochastic optimization. The third way 
is to define a class of non-convex problems, which is, on the one hand, quite 
general, and on the other hand, allows to obtain global performance guarantees of 
an algorithm. We consider a class of problems with objective satisfying Polyak— 
Lojasiewicz condition, which leads to global linear convergence rate, and the class 
of problems with w-weakly quasi-convex objective, which leads to global sublinear 
convergence rate. In the above two approaches, we first focus on first-order methods. 
Then, motivated by several settings in machine learning such as reinforcement 
learning, black-box adversarial attacks on neural networks, as well as simulation 
optimization, in which the gradient of the objective is not available, we consider 
zeroth-order or derivative-free methods and their convergence rates for non-convex 
optimization problems. By no means, we claim that our survey contains all the 
important results in this area since the literature is huge and we could miss some 
recent results. We would like to list here some other books [66, 98, 156, 204] and 
surveys [57, 62, 67, 133, 237, 259, 278] related to our paper. ! 


2 Preliminaries 


The main challenges in non-convex optimization are caused either by non-convexity 
of the feasible set or by non-convexity of the objective function. The first case 
is tightly connected with discrete optimization when the decision variable can 
take only a discrete set of values. In the second case, yet the variable can take a 
continuum number of values, the non-convexity of the problem does not allow to 
hope for finding a global solution in a reasonable amount of time. We start with two 
particular examples that illustrate the intractability of non-convex optimization in 
general. This intractability motivates different kinds of relaxations, such as changing 
the goal to the one consisting of finding an approximate stationary point instead of a 
global minimum, or introducing additional assumptions on the problem, or heavily 
using the structure of the problem, which lead to provable convergence to the global 
minimizer. Next, we present general non-convex optimization problems and some 
ways to classify them. 


'See also this webpage with the list of references being updated https://sunju.org/research/ 
nonconvex/. 


82 M. Danilova et al. 
2.1 Global Optimization Is NP-Hard 


Following [181], we consider an example that illustrates that the problem of finding 
the exact global solution of a non-convex problem is NP-hard. To that end, we 
consider the minimization problem 


n n 2 n 4 
1 
min f (x) =Dx#-1(r9) + (Soar) +(1—x1)*F, 
i=l i=l i=l 


where x; is the i-th component of the vector x. Let A = J — i11', where J is the 
identity matrix of size n and 1 is a vector of 1 ones, and let [x]? denote a vector with 


components Ear = re In this notation, the objective takes the form 


n 4 
f (x) = (ALY, XP) + > a +(1—%)*. 
i=l 


Since A is a positive semidefinite matrix, f(x) > 0. One may also note that 0 is 
an eigenvalue of A with multiplicity 1 and that 1 is the corresponding eigenvector. 
With this in mind, it is not difficult to see that f(x) = 0 if and only if x satisfies 


a, +) ajx; =0, xj =+1, i=2,...,n. 


The problem of checking whether this equation has a solution is a form of the subset 
sum problem, which is known to be NP-complete. Since this problem has a solution 
if and only if the global minimum in the original optimization problem is exactly 
zero, this implies that the problem of finding even the value of a global minimum 
for a non-convex objective is NP-hard. 


2.2 Lower Complexity Bound for Global Optimization 


Following [187], we now derive a lower bound for the complexity of finding an 
approximate global minimum of a possibly non-convex objective. Consider the 
problem 


min f (x), 


xe[0,1]” 
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where f is possibly non-convex and Lipschitz-continuous function, i.e., for some 
M > Oand for all x, y € [0, 1)” 


If -— F@)1< M lly — xlloo- 


Such constant exists for all continuous functions f(x) on [0, 1]”, so this assumption 
is not restrictive. Let us set the desired accuracy in terms of the objective as ¢, 
ie., our goal is to find a point X such that f(t) — f* < «©, where f* is the 
global minimum of f on [0, 1]". For simplicity, we assume é to be equal to 1/n 
for some N € N. Consider a family of continuous non-convex objectives fx (x), 
k = 1,...,N"”, constructed as follows: we divide the hypercube [0, 1]” into 
(“N/2)" non-intersecting hypercubes C; with side length 2/(”M) and set 


—Mdistoo(x, Cx), x € Ck, 


x)= 
fi (x) ‘ 54 Ch, 
where OC, is the boundary of Cx and distoo(x, Cx) is the distance between x and 
dC; in the || - ||o-norm. Each f; has a minimum value of exactly —e attained at the 
center of C;, and the Lipschitz constant of f; is equal to M. 

Any minimization method generating its trajectory based on the values of f(x) 
and its derivatives at the points of the trajectory would need to sample a point from 
each C; to find an approximate minimum of each f; (x). This gives us a lower bound 
on the number of iterations required: 2((MN)") = Q(M"e~"). And this bound is 
attained by the algorithm that simply samples the objective values at the vertices of 
a uniform grid and returns the point with the smallest value. This demonstrates that 
it is practically impossible to solve a high-dimensional non-convex minimization 
problem with any reasonable accuracy unless some additional assumptions are 
introduced. 

A similar complexity bound is proved in [186] for finding a point * such that 
IV f(X)lloo < € and ||Xlloo < R. More precisely, for non-convex functions with 
Lipschitz-continuous Hessian, such that there exists at least one point x* with 


V f (x*) = 0 and ||x*lloo < R, the lower complexity bound is 2 ((e/e)"”). 


2.3. Examples of Non-Convex Problems 


In this subsection, we make a non-extensive overview of non-convex problem 
formulations and applications where they arise, with a focus on tractable problems. 
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One possible way to classify such non-convex problems is to divide them into 
two groups: 


¢ Problems with hidden convexity or analytic solutions 
¢ Problems with provable global solution 


Let us consider formulations of a few concrete problems in each of these classes. 


2.3.1 Problems with Hidden Convexity or Analytic Solutions 


First, it is worth noting a broad class of classical non-convex problems that include 
linear-fractional programs, geometric programs, problems with two quadratic func- 
tions, handling convex equality constraints, convexifying constraint sets. Many such 
problems are equivalent to convex problems via a simple transformation such as 
convex relaxation and duality [38]. 

Next, a wide range of tasks in machine learning and statistics is reduced to eigen- 
problems. Among these problems are the following principal component analysis, 
classical multidimensional scaling, and other generalized eigenvalue problems [56]. 

In the context of non-convex optimization problems, one cannot but mention the 
class of combinatorial optimization problems as graph problems. Basically, most of 
these problems are NP-complete, but despite this, there are effective approaches and 
ways to solve them. Let us consider a closer look at the MAX-CUT problem. This 
is a bright example of convex reformulations. In some problems, the goal is to find 
a point with a value as small as possible (or as large as possible in the context of 
maximization problems), but whether this point is close to the global minimum is 
not that important. In this case, we can try to approximate the problem with a simpler 
one and show that the exact solution to the approximate problem corresponds to a 
good solution of the original problem. We will first illustrate this idea on the MAX- 
CUT problem 


I n,n 


2 
— = Ajj ia j 
me, LO) 5 ij (xi — xj) fF, 


where A = | Aij lees ,; A= A’). This is a discrete optimization problem. If we 
are interested only in the value of the functional and not in the cut itself, we can 
approximate this problem with a computationally tractable one. Let us introduce 


matrix 


n 
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which allows us to write 
f (x) = (x, Lx). 


A simple observation: if ¢ is arandom vector uniformly distributed on the Hamming 
cube {—1, 1}”, then 


f(¢, Lo) >0.5 max (x, Lx). 
xe{—1,1}” 


In fact, we can do better due to the construction of Goemans and Williamson [106] 


max (x,Lx)= max | < max (L, X). 
xe{—1,1}" xe{—1,1}" 


This is an SDP problem. Let »’ be the solution of this SDP problem, and let 
EEN(O,X), ¢ =sign(é). 
Then 


E(¢,L¢)=>agw max (x,Lx), 
xe{—1,1}" 


where agw © 0.878567, and this constant is unimprovable provided that P 4 NP, 
and the unique games conjecture is true [143]. 

Further, we would like to highlight the following subclasses of non-convex 
problems: non-convex proximal operators (hard thresholding [32], Potts mini- 
mization [146]), discrete problems (binary graph segmentation, discrete Potts 
minimization, nearly optimal K -means), infinite-dimensional problems (smoothing 
splines, locally adaptive regression splines, reproducing kernel Hilbert spaces), and 
statistical problems. 

Another important practical example we would like to mention in this part is 
blind deconvolution. Convolutional models arise in a wide range of problems in 
image processing and computer vision. The most basic convolutional data model— 
blind deconvolution—aims to recover a convolution kernel ay € R* and signal x9 € 
R” from their convolution 


y = 40 ® Xo, 


where y € R” and ® is some kind of convolution. This problem is ill-posed 
in general—there are infinitely many (ao, xo) that convolve to produce y. To 
overcome this issue, some low-dimensional priors about ao and xo are necessary. 
As a result, it is essential to use additional constraints and regularization terms. 
Different priors produce different non-convex optimization problems: sparse blind 
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deconvolution [205], multi-channel sparse blind deconvolution [224], subspace 
blind deconvolution [162], and convolutional dictionary learning [199]. 

The seen data in many settings in science and engineering are admixtures of 
several latent sources. Given the observations, we would normally wish to infer 
the latent sources as well as the admixture distribution. The non-negative matrix 
factorization (NMF) [100] mathematical framework offers a natural mathematical 
framework for modeling numerous mixing problems. In NMF, each row of obser- 
vation matrix M € R”*” corresponds to a data point in R’”. Next, the following 
assumptions are used: (1) there are r latent sources, encoded by the unobserved 
matrix W € R’*” and (2) each observed data point can be rewritten as a linear 
combination of the r sources, and the weights of combination are defined via matrix 
A € R"*’. The goal is to find such representation of matrix M that M = AW with 
the entries of M, A, and W being non-negative. The number r is called the inner 
dimension of the factorization, and the smallest possible r is the non-negative rank 
of M. 

Finally, in the part devoted to problems with hidden convexity or analytical 
solution, we would like to deal with compressed sensing and L1 optimization. A 
vector is said to be s-sparse if it has at most s non-zero elements. Consider solving 
Ax = b for x where A is ann x d matrix withn < d. The set of solutions to Ax = b 
is a subspace. However, if we restrict ourselves to s-sparse solutions, under certain 
conditions on A, there is a unique sparse solution [30]. For instance, suppose that 
there were two s-sparse solutions x; and x2. Then x; — x2 would be a 2s-sparse 
solution to the homogeneous system Ax = 0, which would imply that some 2s 
columns of A are linearly dependent. Unless A has 2s linearly dependent columns, 
there can only be one s-sparse solution. 

There are many areas in which the problem is to find the unique sparse solution 
to a linear system. One is in plant breeding [30]. Assume we are given a number 
of apple trees and the strength of some desirable feature of each tree. If we wish to 
determine which genes are responsible for the feature, we may formulate a system 
of linear equations Ax = b in which each row of the matrix A corresponds to a tree 
and each column corresponds to a position on the genome. The vector b corresponds 
to the strength of the desired feature in each tree. The solution x tells us the positions 
on the genome corresponding to the genes that account for the feature. 

The problem of finding a sparse solution can be stated as the optimization 
problem 


min ||x 
min |Ixlo. 


where || ||9 is the number of non-zero coordinates of x. This is an NP-hard problem, 
but it may sometimes be replaced by the convex problem 


min ||x||, . 
min xh, 
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What are the sufficient conditions for 
min ||x > min ||x||,? 
b I llo ah I lly 


A matrix A is said to satisfy the s-restricted isometry property if for any s-sparse x 
there exists 5, such that 


(1 — 8s) |lxll3 < WAxI3 < +45) Ile - 


The following theorems give sufficient conditions for the equivalence mentioned 
above to hold [30, 43]. 


Theorem 1 Suppose A satisfies the s-restricted isometry property with és41 < 
De Suppose xo is s-sparse and satisfies Axo = b. Then, xo is the unique minimum 


l-norm solution to Ax = b. 


Theorem 2 Suppose A_ satisfies the k-restricted isometry property for k € 
{s, 25, 35} with 5; + d25 + 635 < 1. Suppose xo is s-sparse and satisfies Axo = b. 
Then, xo is the unique minimum I-norm solution to Ax = b. 


Such results demonstrate the importance of matrices satisfying the restricted 
isometry property for practice. Fortunately, there is an easy way to obtain such 
matrices [19]. 


Theorem 3 Suppose A is ad x n matrix with elements sampled from the Gaussian 
distribution NV (0, 1/d). Then, A satisfies the s-restricted isometry property for s < 
d with 0 < 4s < 1 with probability ps satisfying 


362 — 5 


3 
Ps > 1 —2(12/5;)" exp (-= ) 


2.3.2 Problems with Convergence Results 


In this section, we would like to give examples of non-convex optimization problems 
for which there are methods with proven convergence results. We start with the 
phase retrieval problem. The phase retrieval problem has been a topic of study 
from at least the early 1980s. It is the recovery of a function given the magnitude 
of its Fourier transform. This problem could be found in various engineering 
and scientific applications such as optical imaging, electron microscopy, and 
crystallography, etc. [222]. We recover a d-dimensional signal vector x* € C4 from 
its phaseless measurements 


Ye = (ag, x)[?, k=1,...,M, 
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with a; denoting the measurement vectors. As a result, the phase retrieval problem 
can be formulated as the following least squares problem or empirical risk mini- 
mization 


M s 
; _ 2 
2s (vx (ax, x)| ) . 


This problem is well-motivated by practical concerns, but unfortunately, this is a 
non-convex problem, and it is not clear how to find a global minimum even if one 
exists. In recent literature, there are various approaches to handle this problem [61, 
241, 260]; also, algorithms with the provable convergence results were presented in 
the following papers [46, 268]. 

In the context of non-convex optimization problems with proven convergence 
result, one cannot but mention low-rank matrix completion. There are related 
problems: matrix completion and matrix sensing [28], which are present in big data 
problems with incompleteness and other machine learning problems. We would 
like to draw attention to the exact low-rank matrix completion. Given a matrix 
Y e€ R"*", partially observed, over a set of indices Q2 C {l,..., ny. Consider 
the problem of finding the lowest-rank matrix matching X on the observed set 


i k (Xx 
min Fan (X) 


S.t. Xij = Vij, Gi, j) € 2. 
This is a non-convex problem having a natural convex relaxation 


min ||X 
ain |X ler 


S.t. Xij = Vij, i,j) € 2. 


In the paper [134], the first results of global optimality of alternating minimization 
were obtained for matrix completion and the related problem of matrix sensing. 
Proofs of (nearly) linear convergence of gradient descent for phase retrieval, matrix 
completion, blind deconvolution can be found in the article [173]. Under some 
assumptions, it can be shown that the solution to the convex problem is exactly equal 
to the solution to the non-convex problem, with high probability over the sampling 
model [42, 44]. So, this problem can also be attributed to statistical problems with 
hidden convexity. Moreover, we emphasize another relevant problem called low- 
rank matrix recovery. This problem is also known to be non-convex but under 
some assumptions has no spurious local minima (see [272, 282] and references 
therein). 
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Deep Learning In the era of AI, training of the deep neural networks [107] is one of 
the most popular optimization problems with an enormous amount of applications, 
e.g., [76, 115, 132, 144, 148, 149, 154, 210, 230, 238]. The simplest example of such 
problem [237] is training fully connected neural network for supervised learning 
problem 


: 1 m 
min f(W) = — D207, fe W)) 
W=(W,....Wp) m * 
WER" *"i-1 i=1,...,L r= 


where {(x;, yi)}/i), x1 € R”™, y; € R"”” are training data points, W = 
(W\,..., Wz) are weights of the model, L is the number of fully connected layers, 
£(-, -) is a loss function, e.g., quadratic loss or logistic loss, and 


Sc; W) = Wid Wi-1¢...¢ (W2¢d (Wi x;))) , 


where ¢ is a scalar” function called an activation function. 

In general, training neural networks is NP-complete problem [31]. Deep neural 
networks have bad local minima both for non-smooth activation functions [214, 240] 
and for smooth ones [166, 269] as well as flat saddles [252]. Nevertheless, there 
exist positive results about training neural networks. First of all, under different 
assumptions, it was shown that all local minima are global for 1-layer neural 
networks [95, 124, 233]. Next, one can show that GD/SGD converges under some 
assumptions to global minimum for linear networks [17, 135, 228] and sufficiently 
wide over-parameterized networks [12]. The detailed summary of recent advances 
in optimization for deep learning can be found in [237]. 


2.3.3 Geometry of Non-Convex Optimization Problems 


In one of the latest surveys [279], the authors want to distinguish a class of 
tractable non-convex problems, which have certain properties of symmetry. They 
highlight non-convex optimization problems with rotational symmetry and discrete 
symmetry. Problems with rotational symmetry include the previously described 
phase retrieval and related problems in low-rank matrix factorization and recovery. 
It turns out that the blind deconvolution and tensor decomposition problems have 
discrete symmetry. 


2 By d(a), where a = (q),..., Gn)! € R" is multidimensional vector, we mean vector 


(o(a1),---, $(an))". 
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3 Deterministic First-Order Methods 


In this section, we focus on the following optimization problem: 


min f (x), (1) 


xE€QCR" 


where Q is a simple, closed, convex, set, and f is the continuously differentiable 
function. The simplest method for this kind of problems is projected gradient 
descent, which can be motivated by a simple continuous-time dynamics. For 
simplicity, we start with the unconstrained case with Q = R”. 


3.1 Unconstrained Minimization 


In the case Q = R", the trajectory of the continuous-time gradient method is the 
solution to the differential equation x = —V f (x(t)). It is easy to see that W (x) = 
Ff («(t)) is a Lyapunov function for this dynamical system. Indeed, 


We = (VF (x), 2) = (VFO), -VE EO) =-IVF MIR <0. 


This implies the convergence of the continuous-time gradient descent method to a 
stationary point. 

The classic gradient descent method is then the Euler discretization of the above 
dynamics and has the form [204] 


Cee ae hV f =) ; 
where hy > 0 is the stepsize of the method. One of the main assumptions in this 


setting is that the function f is L-smooth, or, which is the same, its gradient is 
Lipschitz-continuous, i.e., for some starting point x°, 


veye{xeR": SO) <f(2°)} IVFO)-VF@lb < Lily — xl. 


Then the stepsize h = 1/L guarantees 


oa - Jvred|.. 


Summing up these inequalities, we obtain 
N- 
1 


2L k=0 


a 


N 
IVIGHIG <-— min IV FOI. 


N 0 
fa") -— fe) <- 2L k=0,...,N—1 
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Define f. = on Ff (x), and assume that this value is finite. Then 
xe n 


2L(f (x°) — fe) 


7 (2) 


: ky 2 
min V(x < 
a IV fla < 


This proves that the complexity of finding an approximate stationary point, i.e., a 
point X such that ||V f(x)|l2 < eis O ‘ Lift) Ja), This iteration complexity of 


finding an ¢-stationary point N ~ e~* is unimprovable in terms of its dependence on 


e and L for an arbitrary first-order method applied to minimization of an L-smooth 
objective. 

On the one hand, this bound is much better than the exponential in the dimension 
bound for finding the global minimum, which was derived in Sect. 2.2. On the 
other hand, we can guarantee only an approximate stationary point, which could 
be a saddle point or even a maximum. This can be illustrated by the example of 
minimization of the following objective [187]: 


1 1 1 
f (m1, 42) = 5 (x1)? + 5 (x2)4 — 5 (x2). 


If we set x° = (1, 0), then x* converges to (0, 0)? as k — oo, which is a saddle 
point. The good news here is that gradient descent can be perturbed by adding some 
noise in the iterates in such a way that it converged to a local minimum for almost 
all initial points and escapes saddle points [137]. 

It is important to note that, under additional smoothness assumptions that higher- 
order derivatives of the objective are Lipschitz-continuous, i.e., 


vz.ye{reR": f@)<F(x°\b |vPFoy—V?F |) < Lolly — xh. 


[50, 51] obtain several lower complexity bounds for finding an approximate 
stationary point. If this inequality holds for p € {1, 2}, the lower bound becomes 


et , and the additional assumption that the same holds for p = 3 gives the lower 


bound to e~ cy Surprisingly, Lipschitz continuity of derivatives of order 4 and higher 
gives the same lower complexity bound. 


3.2. Incorporating Simple Constraints 


It is possible to generalize gradient method for the setting of composite optimization 
with simple convex constraints, i.e., for the problem 


mee) = f@)+ w@)}, (3) 


92 M. Danilova et al. 


where Q is a closed convex set, w(x) is a simple convex function, e.g., ||x||1, and f 
is L-smooth function. The standard approach for such problems uses prox-function 
d(x) that is continuously differentiable and strongly convex on Q, i.e., d(y)—d(x)— 
(Vd(x),y—x) = silly — x||? for any x, y € Q. We also define the corresponding 
Bregman divergence V[z](x) = d(x) — d(z) — (d'(z), x — z), x, z € Q. Then the 
step of the gradient method from a point x with stepsize h is generalized [104, 187] 
to 


x? = argmin {veo u) + * VE) + vw , 
ueQ h 


which in the simplest case w(x) = 0, d(x) = 5llxll5, Viz) = Sllx — zil5, 
Q = R" coincides with the step of the gradient method. This generalized gradient 
step leads to a generalized gradient, which is usually referred to as gradient mapping 
[104, 187] go(x) = AG: — xt). In this setting, the authors of [104] prove that 


2L(F(x°) — Fy) 


min ||go(x*)||? < a 


k=0,...,.N—1 


if h = 1/L. Here F, is a lower bound for F(x). In the described above simple 
situation, this bound coincides with the bound (2). The authors of [69] prove that if 
lgo(x)|| < €, then x* is an approximately stationary point of the problem. More 
precisely, there exist p € dy(x*) such that 


Vi (xt) + pe-—No(xt) + A +L@)e), 


where .%@(x*) is the normal cone of Q at the point x*, B(r) = {v € R": |lullx < 
r}—ball in the dual space defined by the conjugate norm, and it is assumed that d is 
L(d)-smooth. Note that there is no contradiction with the exponential lower bound 
given in the end of Sect. 2.2 since non-necessarily the obtained point x* has small 
norm of the gradient. 

This approach was further generalized in [33, 83, 99] for the case of optimization 
with inexact oracle for the function f/f. 


Definition 1 We say that a function f(x) is equipped with an inexact first-order 
oracle on a set X if there exists 5, > 0, and at any point x € X for any number 6, > 
0, there exists a constant L(6,) € (0, +00), and one can calculate fx, 5c, 6u) ER 
and g(x, dc, 6,) € R” satisfying 


| f(x) _ f(, Se, bu) | < 5¢ + bus 


si me L(b_) 2 
fO) — (FO, 8c, bu) — (8, 8c, 5u)s Y X))< 2 l|x y|| + be + bu, Vye Q. 
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In this definition, 5. represents the error of the oracle, which we can control and 
make as small as we would like to. On the opposite, 6, represents the error, 
which we cannot control. The proposed for this setting method in [83] is adaptive 
to the constant L, works under inexact calculation of the point x*, and covers 
several different settings. In particular, smooth functions with H6élder-continuous, 
ie., satisfying, for some v € [0, 1], | Vf) -—VfFO)|lx < Lollx — yll’, Vx, y € Q 
gradient satisfy this definition with 6, = 0 and 


l-v 
l-v 2\i 2. 
L-) = a ar 
(8c) (= =) : 


As a corollary of the general method, [83] propose a universal method for such 
problems, which does not require the knowledge of the constants v, L, and gives 
the following convergence rate: 


l-v 

; 2 aw (l—v 40\> _! (F(x°) — F, € 
< 2 2v pein, LY , 
p_otin,_ [soe (= ae ) : ( 5 +5 


1 
Li (F(x°)—Fy) 


3p to find |go*)| < e. Inexact 
“Dy 


or the following complexity estimate 


E 
oracle models for convex optimization can be useful in non-convex optimization 
since in some settings a non-convex problem can be considered as a convex problem 
with inexact oracle [235, 236]. 


3.3. Incorporating Momentum for Acceleration 


The above-considered dynamical system x = —V/f (x(t)) does not have any 
mechanical intuition behind it. In [203], the author proposed to consider the 
following dynamics: 


w(t) = —Vf (x) — px(t). 
One of the ways to discretize it gives the so-called heavy-ball method 


tl as AV Fo) + Bix — x 4); 
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where h > 0 is the stepsize and 6 > 0 is the momentum parameter. Due to the 
momentum term 6 (x* a 5 ae the method avoids zigzagging for ill-conditioned 
problems, which leads to significant efficiency in practice, especially in training 
neural networks. Despite practical efficiency, the theoretical guarantee for this 
method is no better than that for the gradient method. In particular, [120] considers 
the dynamical system 


M(t)X(t) = —V fF (x(t) — p(x), 


where yt (t) ~ (f (x (f)) — c), c is an upper bound on the global minimum of f (x), 
and p(t) = F (Vf (x (t))). With a special choice of F (-), they show that x (f) 
converges to a local minimizer x!°° such that f (x!) < cast > +00. In [78], it is 
shown that for a discretization of a further generalization of the heavy-ball method, 
one may guarantee 


2L(f (x°) — fa) 
V Fx") < ——_——*, 
gm ee ls W 
which coincides with the bound (2) for the gradient method. 
A different type of momentum was proposed in [184] for convex optimization, 
which led to the Nesterov’s accelerated gradient method 


pox ays (2), 


kt! = xk _ AV f (x* + Be(x* ag yj + Be(x* =x), 


The difference with the heavy-ball method is that the gradient is calculated in the 
extrapolated point. This idea has been very fruitful and allowed to obtain many 
accelerated algorithms for convex optimization. A variant of this method with a 
special choice of the stepsize h and momentum term fx was shown in [103] to have 
the same convergence rate (2) as the gradient method. This was further extended in 
[105] for the case of objective with Hélder-continuous gradients to obtain a bound 


1 
v 0y_ i : . F Pe iaelegts 
Eu EG aE) to find | £0 (x*) | < e in the general setting of composite optimization 
2v 


problem (3) with simple constraints. Importantly, this method is universal and 
uniform, which means that it has best possible convergence rates for convex and 
non-convex problems without knowing whether the problem is convex or not and 
without knowing its smoothness parameters such as Hélder exponent and Hélder 
constant. 
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It is possible to combine this idea with the idea of line search, i.e., minimization 
in the direction of the step. The papers [122, 190] propose a modification of the 
accelerated gradient method that is listed as Algorithm |. Instead of explicitly 
defining the stepsize h and the momentum term f, this method uses full one- 
dimensional relaxation and local information. This makes this method parameter- 
free and uniform for convex and non-convex smooth optimization by providing 
optimal complexity bound for the convex and non-convex case. At the same 
time, inexact line search is possible, and its sufficient accuracy for achieving the 
desired accuracy is estimated. This method shares some similarities with nonlinear 
conjugate gradient methods that were analyzed in [183]. 


Algorithm 1 Accelerated Gradient Method with Small-Dimensional Relaxation 
(AGMsDR) 


k 


Ensure: x 
1: Setk =0, Ao = 0, x° = v®, Wo(x) = VIx°](x) 
2: fork > Odo 
3: 


fx = arg min f (vf + Bot — v5), yt Suk + Beek — vb, 
4: Let (Vf(y*))* be such that (V f(y"), (VFO"))*) = IV FO" )I2 and (Vf)? = 1. 


hpi = argmin f (y§— Av SO"), x= y¥ — te VSO)". 


Find az+1 from equation fO*) _ ay IV FOIE = ft), 
Set Agz1 = Ag + x41. 

Set Wig1(t) = Wala) + acri{f Oo) + (VFO), x — v4}. 

uk+! — arg mincern Wepi(x), k =k +1 


OO IN 


end for 


The above idea was further extended in [123] where an accelerated alternating 
minimization method was proposed and analyzed for convex and non-convex 
problems. The main assumption is that the set of coordinates is divided into n 
disjoint subsets (blocks) Iy, p € {1,...,m}, and minimization in each block when 
the other variables are freezed can be made explicitly. The resulting accelerated 
alternating minimization algorithm is listed as Algorithm 2. This method is also 
parameter-free and uniform for convex and non-convex smooth optimization with 
optimal complexity bound for the convex and non-convex case. 
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Algorithm 2 Accelerated Alternating Minimization (AAM) 


Require: Starting point x°. 
Ensure: x* 
1: Set Ag = 0, x° = v9. 
2: fork > Odo 
3: Set 6, = arg min xk 4+ B(vk — xk 
Px = arg min f (xt + Book — x4) 


4 Set y* = xk + By(vk — x*) 
5 Choose i, = arg max IVi FOF 
ie{L,...,7} 
6 Set x*+! — arg min J (x), i.e. minimize f in the corresponding block. 
xESi, (y") 
a Find ag41, Apt, = Ag + 4¢41 from 


2 
fo — “1 wok = fat 
2AK+ 


8: Set v&+! = vk — aay V fo") 
9: end for 


The sequence y* of this algorithm satisfies 


kya 2AL(f(@°) — fa) 
mae IV FO Ie < SS ee 


i.e., there is an additional multiplier M@—the number of blocks. If the function turns 
out to be convex, then the same method generates the sequence x* that gives the 
decay of the objective similar to accelerated gradient method: 


2A L||x° — x*|\5 


f@)-f@N< a 


’ 


where x* is the closest to x° global minimizer. 

By exploiting the idea of Nesterov’s acceleration and combining it with the 
notion of negative curvature, the authors of [48] manage to accelerate first-order 
methods for non-convex optimization under additional assumptions that second and 
third derivatives are Lipschitz-continuous. More precisely, if L-smooth function 
also has Lipschitz continuous Hessian, they obtain complexity O (e~7/ 4log(1/ €)) 
to find a point x such that ||V f(*)|l2 < e¢. Assuming additionally that the third 
derivative is Lipschitz, this bound is improved to O (e~>/ 3 log(1/ €)). 
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4 Stochastic First-Order Methods 


In this section, we consider the same problem as in Sect. 3: 
min f(x), (4) 
xeR" 


where function f is a general non-convex L-smooth function with the uniform lower 
bound f;,, i.e., itis differentiable and 


fQ@)=> fe Wx ER’, 
IV f(x) -— VAO)Il2 < Lilx — yll2 Vx, y eR". 


We are interested in two particular cases: expectation minimization 


f(x) = Eel f , &)], (5) 
and finite-sum minimization 
1 m 
Fo = 7 2, iG. (6) 


Such problems usually arise in applications of (deep) machine learning [107, 237] 
and mathematical statistics [234], and typically, they are solved via stochastic first- 
order methods. 

In general, the best one can expect to achieve is an approximate stationary point 
[15, 251]. To be specific, for this class of problems, stochastic first-order methods 
in the worst case can only find such point x that 


[iv e@ia] se (7) 


For simplicity, we will call the point x as e-stationary point, but mean by this that 
inequality (7) holds. 

Below we summarize recent results about finding ¢-stationary point using 
stochastic first-order methods. We start with presenting the general and unified 
approach to analyze optimal deterministic and stochastic first-order methods for 
objectives of types (5) and (6) in the general settings. After that, we consider 3 big 
classes of stochastic first-order methods with convergence guarantees: SGD and its 
variants, variance-reduced methods, and adaptive stochastic methods. 
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4.1 General View on Optimal Deterministic and Stochastic 
First-Order Methods for Non-Convex Optimization 


Assume that at each point x, we have access to the estimator g(x) of the gradient 
V f (x). For now, it is not important to specify what properties g(x) satisfies. In these 
settings, one can use Algorithm 3 in order to find e-stationary point. 


Algorithm 3 General scheme of the optimal first-order method for non-convex 
optimization 


Require: learning rates {hx};>0 satisfying hy < ar starting point x° € R”, stopping criterion C 
1: for k=0,1,2,... do 
2: Get gk = g(x*) 
3 if C holds then 


4 xN = xk 
5 break 
6: else 
7: xktl = xk — py gk 
8: end if 
9: end for 
N 


10: return x 


Below we derive preliminary inequalities playing the central role in the analysis 
of optimal (stochastic) first-order algorithms. From L-smoothness of f, we have 


fe" V2 (G74 Vi xh Sixt ral 


L 
= f (x*) + (gk xhtt — xy + (WF) — gh xt — x*) + Hyp — x 115 


1 L 
< f(x*) — hella lS + hell VEC — 913 4 (z . 5 ) ast — at 


where in the last inequality we use Fenchel—Young inequality: (a, b) < oa llall5 + 
S||bII5 with a = Vf (x*) — gk, b = xkt! — x* anda = oie Since hy < 57 and 


xk+! — y* — hy g*, we can continue our derivations: 


h 
POM) = for) — Sieh} + hell fr) — gh. 


Now it is crucial to specify what we need to assume about g(x). We emphasize that 
all 3 cases considered below are based on the tight bounds for || V f(x") — g* 15 or 
its expectation. 
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4.1.1 Deterministic Case 


In this case, we assume that for all x € IR”, we have an access to such g(x) that 


6 
Ig) — VAOI5 < To" (8) 


In other words, g(x) is good ey approximation of V f(x). Consider the 


stopping criterion C = {iis* IS < 2 | and let hy = sr for all k > 0. First of 


all, if Algorithm 3 stops, then || g% ||2 < a and x" satisfies 
(8) 
IV FMI = IVE) — 9% + N12 < ZVI") — 83 + Il" I Se”. 
Next, we derive an upper bound for such NW that Algorithm 3 stops after N 


iterations. Assume that, after N iterations, the method has not stopped. Then for 
allk =0,1,...,7, we have 


(10),(8) Ane? hye Ee 
k+1 iv ky “tk as - he. 
far) s FO") + f(x") - Z 
Unrolling the recurrence, we obtain 
fo") < fo- way 


20L 


| 


Z 20L(f(x°) — f@%*)) 1 20L(f(x°) — fa) 


362 _ 362 


Therefore, the methods stop after 


Z 20L(f (x°) — fe) 
- 362 


iterations. This bound is optimal up to constant factors [51]. 


4.1.2 Stochastic Case: Uniformly Bounded Variance 


In this case, we assume that for all x € IR” we have 


62 


7° 


(9) 


gx) la] =Vfx), Elie) —V/@I5 |x] < 
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For example, this situation appears when 


FX) = Fel f(x, 6] 


where & is a random variable with distribution Y and g(x) is formed as 


1 r 
g(x) = — ) 1 V file, &) (10) 
i=1 
where &|,..., & are i.id. samples from Y and 
A[VE@.O1=VI@), Ee [IVSO,8) — VF@IB] <0. (11) 
Indeed, if we choose r = max {1,2 26°, then due to independence of &),..., &-, we 
have 
: > 1 2 ao? 
[is@) — VF@IR lx] = 5 De ss [IVF &) — VF@)IB] = — < 5. 


Then, taking conditional expectation E [- | x from both the sides of (10), we derive 


a [fa |x] < for) — Se [Ie'3 | a*] + ia [ IIe — VIG |] 


= f(x*)- 


hk 
2 
h h 

SlVr6515 — SE[Ie* — VF@515 | x] 


+h [ile* — VFI | 2] 


h h 
= fe — Sv c613 + SB [Ie — v@513 | x] 


(9) h 
< fa — FIV CDNB +—— 


After that, we as the full expectation from both the sides of the previous inequality, 


choose hy = sf and sum up the result for k = 0, 1, ,N-1: 
i ky 2 4L . k . k+1 gs? 
= LE [iveeia] = SO (Ere) - Bre) + > 
k=0 k=0 
_ 4L (f°) — ELF) ge 
= N 2 
2 4L (f(x°) = i) 4 a 


= W ie 
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Finally, we choose the output of the method <% uniformly at random from 


x9 xt. NO! which implies 


L (Fie) _ i) - 
N ss 2° 


fiivre)B] < 


are )— fx) 


Taking N = , we obtain B[V FAY) ales > Moreover, the total 
number of dadhaske diate calls (the number of V f (x, &)- a Sera is 


e2 : ef 


vn = max | — (F@°) a te) 16L LF") — fx) “| 
k=0 


This bound is optimal up to constant factors for the case when the variance is 
uniformly upper bounded [15]. 


4.1.3 Stochastic Case: Finite-Sum Minimization 


In this case, we assume that the objective function has a finite-sum structure (6) with 
L-smooth summands. In fact, this smoothness constant L can be significantly larger 
than the smoothness constant of f. It is essential for providing a fair comparison 
of different complexity results. It is possible to improve the dependence on L 
in the final complexity bounds [164] using average smoothness assumption, but 
for simplicity we consider the case when all summands are L-smooth. Moreover, 
we assume that there exists constant o7 (possibly infinite) such that for € taken 
uniformly at random from {1, ...,m} and for all x € R” 


rs [IV fe) — VFI] <0. (12) 


We define 7; and g* in the following way: 


a | 
rk =r = max La ; 
E 
q = min{r, m}, 
r 
; D V fex,j (x*), if r < mand r divides k, 
: = 
8 V f(x), if m <r and m divides k, 
V fey (xk) — V fey (xk-!) + g'-!_ otherwise 
1 
h=h= 


10L /q 
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Here, at iteration k random index & is sampled uniformly at random from 
{1,...,m} if k is not divisible by g and random indices &1,..., &&- are iid. 
samples from uniform distribution on {1,...,m} if g = r and r divides k. As 
the result, we obtain the variant of SPIDER [91]. We notice that for k = aq + p, 
p €{0,1,...,q— 1} iteration k requires 2 calculations of V fz (x) when p 4 0 and 
q calculations of V fz(x) when p = 0. This implies that g iterations of the method 
require only 3g calculations of V fz (x), so, if k > q, then the number of stochastic 
first-order oracle coincides with the number of iterations up to a constant factor 3. 

Below we present a simplified approach to analyze SPIDER. As before, our goal 
is to show that E [ll gf = Vi (x*) 13] can be upper-bounded by either something 
small or something that can be controlled by other terms in (10). First of all, if 
k = aq, then 


" if q =m, 
2 
e[ie® —ve13] = r | 
: Ly vf, @)—-vFe4] |, itger 
j=l 4 
ifq =m, 


>> [IV fa.) — V£E5]5]. ifg=r 


j= 


15 i = 16 j = 
21 ifg =m, pale ifg =m, 


2 
oO & ‘ 
: x ifg=r, 


an 


, ifg=r, 


2 
. r Sa 2 
whereE | |? > Vf) —VFG] |=5 > [IY fi. - Vf ]5| 
j=l 5) j=l 
due to independence of &1,..., &&,-, and in the third inequality, we applied the 
tower property: E[-] = a |. | raat Second, ifk = aq+p with p € {1,...,q—l}, 
we have 


a |e - vr65],] o | [wa - V0! ') + gh '—vF0%),| 


[Vite - Vier) = vF05 + vreth| | 


self -9f 
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where we use the variance decomposition? Le [lnll5] = Eg, |\ln — Qe, [15 | + 
| ve, [7] | for random vector 7 = V fe, (x*)—V fe, (x* 4gk LoVe) together 
with the tower property E[-] = E [Ee, [-]]. Using the inequality above together with 
lla + bl|Z < 2|lal|Z + 2||b||3, a, b € R” and L-smoothness of fi,..., fm, f, we get 


[G - vse], <2 [| vse - vino), 


N 


+ 


Thedeae = veer) | 


se[]e-vref] 


< ADE | Ix — x13] + [fet ee i I. 
— 412h2 ‘[iis! 3] + 1G 1_vfcxt |; 
< 8 rE[IV ee ')13| 


+(1+827h") [xt L_ ow pcck [3]: 


Unrolling the recurrence, we derive 


| [et 7 vee), < 82h? 3 (1 4 sh) afiv s(t 13] 
l=1 


+(14 4020")! ele — v6" 13 


(20), p< a 
ae (1+ 827n?) > 82707E [IV F913] 
l=1 


0, ifg=m, 
+ (1+827h?)" a 
x» ifg=r 


3 Here é,[-] is a mathematical expectation conditioned on everything despite &, i.e., expectation 
is taken w.r.t. the randomness coming only from &,. 
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(1+x)4<e™ 


exp (817h ‘) Davie [iv ree 13] 


0, ifg=m, 
+ exp (817h7q) ee 
50? ifg=r. 


Next, using the choice of the stepsize h = 1/(10L vq), we obtain 


[e - vie], < Dons [iv roar] + HE. 
l=1 
Finally, we put all the inequalities together. We start with modifying (10): 
fob) = fot) — Angtid + mv fe — 9tIB 
=f@)= alivFe1B + Savre’ — 13, 


where we used that inequality ||a + b|} = 5llall3 — ||D||5 holds for all a,b € R" 
(in particular, we use a = V fk) and b = gk - ViG*)). Next, we take the full 
mathematical expectation from the both sides of previous inequality (taking into 
account that k = aq + p): 


Lf ameter ly] < BEF extty] — 2B IVF?) 13 | 


3h 
“Ek aqt+p _ aqt+py||2 
+55 Is V(x? yI| 


IA 


Lfeet?y] — ZBL IV Fo?) 


3h . 2,29 aqt+ly 2 
eo Liv ret] 

33h" 

400 © 
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We notice that this inequality holds for all integers a > 0 and p ¢€ {0,...,q — 1}. 
Summing up these inequalities for p = 0,..., P and taking a = A, where N = 
Aq+ P,P € {0,...,q — 1}, we get 


P P 
0 < Dd (ELF +9] - ELF cAatetty)) -i> afv feet] 
p=0 p=0 
2,3 P P 2 
p=0 /=1 
P 
— a[ fo] - alee] n(; met\y 3 fv foe4e*?) 13] 


33he2(P + 1) 
+ ——_—__— 
400 


 w[rot]-E[porerrry] - Lv reer] 


200 os 
33he2(P + 1) 
> ———-——, 
400 
hence, 
P 2 
23h c Aq4 2 = A - A 33he (P + 1) 
etal qt+P qy| — q+P+1 Ee ee 
0 DElIvse 3] < E[ ¢@4|-E[ faery] 4 ==. 


These inequalities hold for all A and P. Then we can sum up these inequalities for 
(A, P) = (0,q—-1),d,q-1),..., (A, P) and get that for N = Aq-+ P and divide 


the result by Bnd and get 
3 [iv re13] ae) 
= Y x < 7 - 
NS : 23h(N + 1) 46 
(19) 2000L./q (f (x°) — fr) é 3367 
= 23(N + 1) 46 


Finally, taking x V uniformly at random from x°,..., xN , we get 


2000L /q (f (x°) — fx) in 336° 
23(N +1) 46 


s[ivee®yi5] s 
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This implies that after 


y _ f000L AF) = FO*) 

~ 1362 
(16,17) 4000L(f (x°) — f(x*)) _, 
=> ml 


0 vm} ee 


13¢2 € 


iterations, we reach E I ViI@*) 3] < e*. Moreover, it requires 


62 


0) _ * 2 
O (A y= Fa") min | vin, max {1 =|] + nin {mma | =}}) 
E € 


calculations of V fz (x), which is optimal up to constant factors [91]. 


4.2 SGDand Its Variants 


As it was shown in the previous section, SGD 


gt a ge hyg(x*), Ee (x)] = Vf (x) 


“(4-0 
in the settings of Sect. 4.1.2 requires O (EQ +) iterations with batch size r = 


2, 
oO (max {1 pa \) to find an €-stationary point in expectation. The total number of 
stochastic first-order oracle calls equals 


0) _ 2 
O (FL imax 1, =H). (13) 
€ € 


We emphasize that we use large batch size for the sake of simplicity and unification 
of the results in 3 different cases. In fact, it is possible to obtain the bound (13) using 
smaller stepsizes and constant batch sizes of the order O(1) [102]. 


4.2.1 Assumptions on the Stochastic Gradient 
In addition to assumption (11), which is quite restrictive, there exist several other 


assumptions on the stochastic gradient studied in the literature. Recently, in [142], 
it was proposed a simple and unified way to cover the most popular ones. 


Recent Theoretical Advances in Non-Convex Optimization 107 


Assumption 4.2 (Expected Smoothness; Assumption 2 from [142]) The second 
moment of stochastic gradients satisfies 


“[Ie@ol3] < 24 (FO) — fa) + BIVS@DIZ +C (14) 


for some A, B,C = Oand forall x € R". 


This assumption generalizes the notion of expected smoothness introduced and 
adjusted for convex problems in [117]. Moreover, the following assumptions are 
stronger than Assumption 4.2 or can be seen as special cases of Assumption 4.2 
(see more details and formal proofs in [142]). 


Uniformly Upper-Bounded Variance (UV) Assumption Indeed, if A = 0, B = 
1, and C = o°, then using variance decomposition inequality (14) implies (11): 


(14) 
3 ilg(x) — VF(@)I3] = E[Ie@oll3] - IVF@IZ Ss 0°. 


Expected Strong Growth Condition (E-SG) When A = C = O and B =a > 
1, inequality (14) transforms into the so-called expected strong growth condition 
[232, 249]: 


5[geoi3] < av FOoI3. (15) 


Maximal Strong Growth Condition (M-SG) References [216, 246] states that 
there exists such a > OQ that 


I|g(x)||5 < al] Vf (x)||5 almost surely for all x € R”. 


This condition implies E-SG (15), while known convergence results in expectation 
under M-SG assumption have no advantage in comparison with their counterparts 
under E-SG. 


Relaxed Growth Condition (RG) Reference [37] can be seen as another special 
case of Assumption 4.2 with A = 0, B =a > 1, and C = 6 > O or as an extension 
of E-SG: 


3[Igeoil3] < av S013 + B. (16) 


However, there exist simple problems of type (4)+(5) that fit the settings we are 
interested in but do not satisfy (16) (see Proposition 1 from [142]). 


Gradient Confusion Condition (GC) Reference [215] was developed for the 
finite-sum case (6). In particular, it states that there exists such 7 > 0 that for all 
i,j =1,...,m and for all x € R” 


(V fix), V fj(%)) = —n. (17) 
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One can show (see Theorem 1, [142]) that inequality (17) implies (16) with a = m 
and 6 = n(m-— 1), and as a consequence, it is a special case of Assumption 4.2 with 
A=0, B =m, and C = n(m — 1). 


Sure-Smoothness Condition (SS) Reference [160] is defined for the case when 
the objective is represented as an expectation (5) and g(x) = V f(x, &) where é is 
sampled independently at each iteration of SGD. That is, sure-smoothness condition 
means that* for all x, y € R” 


IV Fx, &) — VF, &ll2 < Lilx — yll2 and f(x,€)>O0 almost surely in &. 
(18) 


Applying classical corollaries of L-smoothness, one can derive inequality (14) with 
A =2L, B =0, and C = 2Lf, from (18). 

Next, Assumption 4.2 covers arbitrary sampling setup and distributed setup 
with quantization.° For simplicity, we mention only sampling with replacement as 
a special case of arbitrary sampling (see more examples in [142]). In particular, 
consider the finite-sum optimization problem (4)+(6) and assume that fj is L;- 
smooth and bounded from below by fj, for alli = 1,...,m. Moreover, assume 
that g(x) = Vfj(x), where 7 = i with probability pj = 0,i = 1,...,m, 
ye pi = 1. Then, one can prove [142] that Assumption 4.2 is satisfied in this 


case with A = max; ans B =0,andC = 2AA, = *4 77, (fa — fix). That 


m i= 


is, if we apply uniform sampling, ic., pj = Pa for alli = 1,...,m, then we 

get A = max; L;, B = 0, C = 2max; L; Ax, and if importance sampling with 

Di = sea is applied, then Assumption 4.2 holds with A = L = 4 S77", Li, 
l=1 


B =0,andC = 2LA,. 
Finally, under Assumption 4.2, Khaled and Richtarik [142] derived the following 


1 1 ee 
LAN’ LB’ 2LC 


complexity bound: if h = min , then inequality 


: ; , 
o<keN-1 [iv fee dia <é (19) 
is satisfied after 
0) _ a . 
€ ; : 


4In the original paper [160], the authors considered more general situation when stochastic 
realizations f(x, €&) have Hélder-continuous gradients. 

5 This technique is applied in distributed optimization to reduce the overall communication cost 
(e.g., see [4, 27, 113]). However, methods for distributed optimization are out of scope of our 
survey. 
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Table 1 Summary of the complexity results for SGD under different assumptions on the stochastic 
gradient. The column “Complexity” contains an overall number of stochastic first-order oracle calls 
needed to find ¢-stationary point neglecting constant factors. Notation: Ag = f(x°) — fr, 07 = 
a uniform bound for the variance of the stochastic gradient (11), a, 8 = relaxed growth condition 
parameters, 7 = gradient confusion parameter, A, = Soe ifs — fi), max; Lj = maximal 


smoothness constant of f; in (6), L = averaged smoothness constant of f; in (6) 


Problem Settings Citation Complexity 

(4)+(5) UV (11) [102] £40 max {I, =| 
(4)+(5)/(6) RG (16) (37, 249] £40 max | @, 4] 

(4)+(6) GC (17) [215] £40 max { m, n=) | 
(4)+(6) Uniform Sampling [142] Emon Fito max {Ao, A} 
(4)+(6) Importance Sampling [142] zhao max {Ag, Ax} 


iterations of SGD. It is worth to mention that this bound gives the sharpest rates for 
all known special cases. We summarize some of them in Table 1. We notice that (19) 
is weaker than (7), but it is easy to obtain the same bound (20) guaranteeing (7) 
instead of (19) based on the analysis given in [142]. 


4.2.2 The Choice of the Stepsize 


In practice, instead of using the constant stepsize for SGD, it is popular to 
periodically decrease the stepsize by some factor [35, 128, 153] even for non-convex 
problems. For strongly convex problems such a choice is natural: it is well-known 
[111] that if the stepsize equals h and strong convexity parameter equals ju, then 
SGD converges with linear rate O ((hp)—!) to the neighborhood of the solution 
with size proportional to h. Surprisingly, SGD enjoys similar behavior even for non- 
convex problems, which was recently shown in [225]. 

In the neural networks training, “warmup” [116, 119] and cyclical stepsize 
[171, 231] schedules are also very popular and useful. The first one refers to the 
strategy when, during several epochs of training, tiny stepsizes are used, and then 
they are increased. This technique was successfully applied for several deep learning 
problems such as ResNet [128], large-batch training of ImageNet [119], and natural 
language problems [77, 248]. 

Cyclical stepsize schedule means that the stepsize is changing between some 
lower and upper bounds. There are different modification of this technique including 
gradual decrease and increase during one epoch [231] and gradual decrease of 
the stepsize followed by the sudden increase [171]. However, the theoretical 
understanding of the success of “warmup” and cyclical schedules is very limited. 

We also discuss different stepsize policies including adaptive ones (Sect. 4.4), 
Armijo line search under expected strong growth assumption, and stochastic 
Polyak stepsizes under relaxed growth assumption (Sect. 4.2.3) in the following 
subsections. 
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4.2.3. Over-Parameterized Models 


In Sect. 2.3.2, we mentioned that over-parameterization [12, 13, 163, 168, 191, 194, 
281], meaning that the last layer has more neurons than the number of samples 
in the training set, is a good property for neural networks from the optimization 
and generalization [10, 11, 174] point perspectives, but not a panacea: over- 
parameterized neural networks have no spurious valleys but still can have bad local 
minima [79]. 

In the papers, focusing mostly on the optimization aspects of over-parameterized 
models, it was shown that SGD converges with the same (up to the difference in 
the smoothness constants) rate as GD in terms of the iteration complexity in convex 
and strongly convex cases [169, 249, 250] under interpolation condition: for the 
finite-sum optimization problem (4)+(6), there exists such point x* € IR” that 


min fi(x) = fi(x*) Vi=1,...,m. (21) 
xeER” 


Furthermore, in this setting, SGD converges with Armijo line search [250], with 
stochastic Polyak stepsizes [169], and if additionally expected strong growth 
condition (15) holds, SGD can be accelerated [249] and the accelerated version 
converges as good as Nesterov’s method [184] in terms of iteration complexity up 
to expected strong growth multiplicative factor a from (15). 

In the general non-convex case, the following results exist. 


Constant Stepsizes In [249], it was shown that SGD with constant stepsize h = 
'/aL finds ¢-stationary point under expected strong growth condition (15) with the 
rate O (@L(s (2°)— fa)/e?) matching the iteration complexity of GD up to the factor a. 


Armijo Line Search The idea that under interpolation condition/expected strong 
growth condition, SGD and GD have similar properties was then strengthened in 
[250], where the authors showed that SGD with Armijo line search converges in 
these settings. In particular, the authors of [250] considered such stepsizes h; that 


Fig (o® — kV fig") < fie) — che ll V fy OIG, (22) 


where the index i, is sampled uniformly at random from the set {1,..., m}, the 
stochastic gradient g* is defined as g§ = V fix (x*), and c > O is a hyper-parameter. 
Moreover, it is assumed that hg € (0,/max] for all k > 0. Then SGD with 
Armijo line search (22) with c > 1 — Lmax/(aL) and Amax < 2/aL finds e-stationary 
point under expected strong growth condition (15) with the rate O (oF (x°)—fa)/ (5e)), 

2c) 


where 5 = (Amax + 2U-©/Limax) — (Pinas ee Tn): 


L max is the maximal 
smoothness constant of summands f;, and f is the smoothness constant of f. The 
authors of [250] also considered the version with samples used for backtracking (22) 
independent from those used for determining the stochastic gradient, and the 
version with non-increasing stepsizes under additional assumption that the iterates 


lie in some ball with radius D. The rates are O (max{Lmax.eL}(f(2°)—fs)/e?) and 
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O (max{Lmax.aL}LD*/e?), respectively, and both complexity bounds hold with c = 1/2 
and max = !/(aL). Finally, in the numerical experiments from [250], the authors 
observed that the method’s performance is robust to the choice of c and /max. 


Stochastic Polyak Stepsizes Next, SGD under expected strong growth condition 
converges with stochastic Polyak stepsizes introduced and analyzed in [169]: 


ee stnin { fee, ny} (23) 
clV fi, "Ila 
where the index ix is sampled uniformly at random from the set {1,..., m}, the 


stochastic gradient g* is defined as gk =V fix (x*), fi.« 1s uniform lower bound for 
fi(x), and c > 0 is a hyper-parameter. In particular, one can show [169] that SGD 
in these settings with c > @4/4Lmax and hp < max {2/(aL), hp} finds e-stationary 
point under expected strong growth condition (15) with the rate O (Ff (x°)— fa)/(8e2)), 
where 6 = (hp + B) —@ (hp —Bp+ Lh;), B = min {!/2cLmax), hp}, and 


~(a-1) + /(@— 12 + Haft 


h 
— 2La 


4.2.4 Proximal Variants 


In the previous subsections, all complexity results rely on the smoothness of the 
objective function. The natural question arises: is it possible to generalize these 
results to the non-smooth case? In the recent work [151], the authors give a negative 
answer to this question for generally non-smooth non-convex functions, i.e., one 
cannot find efficiently via first-order methods near ¢-stationary points. However, 
many complexity results that we mentioned before and will mention in the following 
subsections have generalizations to the composite optimization problems: 


min {F(x) = f(x) + R@)}, 
xeR" 


where the function f is L-smooth, but, possibly, non-convex, while R(x), ie., 
composite term/regularizer, is a proper closed convex function that can be non- 
smooth. Moreover, function R(x) is often chosen in such a way that the proximal 
operator 


prox p(x) = argmin {Ro + sly — si3| 
yeR" 


can be easily computed, and to make the solution of the problem satisfy certain 
properties, e.g., sparsity; see [18, 45, 64] for the detailed discussion and examples 
of regularizers. 
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In these settings, instead of SGD, one can apply prox-SGD defined by the 
following recurrence: 
oS prox), p(x* — hyg*). 
Moreover, to measure the progress of the method, the generalized projected 
stochastic gradient is used: @* = (‘—x**!)/n,. When the regularizer R(x) is a 


constant, ¢* = g*. For proximal stochastic methods, we say that the iterate x* is 
€-stationary point if 


:[e13] se. 


In [104], it was shown that prox-SGD under uniformly upper-bounded variance 
assumption (11) converges with the rate given in (13). However, the analysis from 
[104] works only in the large-batch setting, i.e., when batch sizes are of the order 
O(e~*). For a long time, there was no analysis establishing the same bound without 
using O(e~7) batches, and the problem was recently resolved in [70]. 


4.2.5 Momentum-SGD 


As we already mentioned, SGD is optimal among stochastic first-order methods 
for finding ¢-stationary points under uniformly bounded variance assumption [15]. 
However, it does not imply that there is no sense in using different methods for 
such problems. In practice, different additional tricks are applied to improve the 
convergence of SGD, and, perhaps, the most popular one is momentum [203]. 

Momentum-SGD/Heavy-Ball SGD can be written in different forms. Usually, it 
is written as 


m**) = Bum* + g(x"), 


ath = x* — hyg(x'), 


where parameter 6, € [0, 1) is called momentum parameter. In the convex and 
strongly convex cases, this method has some advantages in comparison to SGD 
such as better last-iterate convergence guarantees [219, 242, 243] but does not have 
an accelerated rate [145]. In the non-convex case, Momentum-SGD has the same 
complexity guarantee (13) as SGD under uniformly bounded variance assumption 
[71, 267]. However, in practice, Momentum-SGD often works much better than 
SGD especially on computer vision problems [239] and also navigates ravines and 
escapes saddle points better than SGD. 

Among other works on Momentum-SGD, we emphasize the recent paper [71] 
establishing the tight convergence rates for Momentum-SGD in Stochastic Primal 
Averaging [242] form via Lyapunov functions analysis. In particular, [71] justifies 
(theoretically and/or empirically) the following important insights about the behav- 
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ior of Momentum-SGD: (1) Momentum-SGD is provably better than SGD during 
the early stage of the convergence, (ii) it is better to gradually reduce momentum 
parameter 6, rather than the stepsize h;, and (iii) gradual changes of the parameters 
of Momentum-SGD are preferable than sudden changes. 


4.2.6 Random Reshuffling 


Before this subsection, we always assumed that stochastic gradients are sampled 
independently from previous iterations. However, in the context of finite sum 
optimization (4)+(6), the different sampling strategy called random reshuffling (or 
SGD with or without replacement sampling) is often used: at each epoch (pass 
through the dataset), random permutation {i1, i2,...,im} of the set {1,2,...,m} 
is generated defining the order of gradient computations (see Algorithm 4). This 
strategy implies that stochastic gradient in RR is biased. 


Algorithm 4 Random Reshuffling (RR) 


Require: learning rates {hs x}5,x>0, Starting point x9 € R", batch size r > 1, number of epochs S$ 


Set xe = x? 

for s =0,1,2,...K —1 do 
Generate random permutation {is1,..., is,m} of the set {1,..., m} 
Set / = [m/r] 


fork =0,1,...,/-—1do 
Set #* = min{r, m — kr} 


rk 
- 1< 

Compute gf = =)? V fiskrsj 8) 
§ j=l 


yc k 
=X, — hs k&5 


l 
return Xx; _| 


While the superiority of RR to SGD was empirically discovered a long time 
ago [34, 36], the theoretical justification of this phenomenon was developed only 
recently [126, 180, 195, 206]. In particular, the authors of [195] proved that RR 
under uniformly bounded gradients assumption, 


lfioll2< G Vi=1,...,m, Vx ER’, 


finds ¢-stationary point with the rate O (Lmaxm(f(x°) — f.) (e~? + Ge), 
where Lmax is the maximal smoothness constant of summands /}|,..., fi. Then, 
in [180], this result was generalized and tightened: under the assumption 


m 


1 
— ST IV Ale) — VIB S24 (FO) = fe) +, (24) 
i=1 
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which is a special case of (14) with B = 1, the authors of [180] derived the following 
bound: 


O (Hain fo (4 + VACFOD = fo) + *)) , (25) 


2 e 


That is, under uniformly bounded variance assumption (11), this bound transforms 
(A = 0, C = 07) into O (Lmax/m(f (x°) — fx) (me? + oe~9)), which outper- 
forms the corresponding complexity bound for SGD (13) whenever Lymax./me < 
Lo. Next, one can show that for Lmax-smooth f; uniformly lower bounded by fi, x, 
i = 1,...,m, (24) holds with A = Lymax and C = 2LmaxAx = *E mux alk 
fi.«), and as a consequence of (25), RR converges with the rate 


Lmax 0) — Lmax Ax 
o (na/miro" fe) (4 4 VEmax(f@°) Fes i taads ) 


which is better than the corresponding bound for SGD (see Table 1) when 


Ly f (x°) — fx = &s/Lmaxim and LJ Ay = &s/Linaxm. 


4.3 Variance-Reduced Methods 


In this section, we discuss variance reduction for non-convex optimization—a 
special technique aimed at improving the convergence speed of SGD for finite-sum 
optimization problems (4)+(6). The typical behavior of SGD with constant stepsize 
hand batch sizer < m is as follows: during the first iterations, the method converges 
rapidly to some neighborhood of the solution or local minimum, and then it starts 
to oscillate in this neighborhood. Such oscillations of SGD are common even for 
strongly convex problems meaning that it is not a drawback of the problem. The 
size of the oscillation region is proportional to “o7/,, and this fact hints two simple 
and famous remedies: decreasing (gradually or suddenly) or small stepsizes and 
large enough batch sizes. However, the first option can make the convergence too 
slow, and the second option dramatically increases the iteration cost. 

To remove these drawbacks, one can apply variance-reduced methods such as 
SAG [217], SAGA [73], SVRG [140], Finito [74], MISO [175]. In particular, all of 


the mentioned methods have O ( (m + 4/n) In 1) convergence rate in the jz-strongly 


convex case. What is more, they use constant stepsize, and at each iteration (besides 
each m-th iteration or besides the first one), they require one computation of the 
stochastic gradient with batch size r = | in the strongly convex case. 

Among variance-reduced methods, SAGA and SVRG are the most popular ones 
(see Algorithms 5 and 6). 
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Algorithm 5 SAGA [73, 208] 


Require: learning rate h > 0, starting point x° € R”, batch size r > 1 
Set ¢) = x° for each j € [m] 


m 
v =F VA) 
i=1 : 
for k =0,1,2,... do 
Uniformly randomly pick sets x, J; from {1, 2, ..., m} (with replacement) such that |/,| = 
Jel =r 
sk=2 > (VAG -— VAG) + 
tel, 
xhtl — xk — pgk 


o = x* for j € Jj and git! = oi for j ¢ Jk 
vt = ok 1 (vA) — VF") 


JESk 
end for 


Algorithm 6 SVRG [140, 208] 


Require: learning rate h > 0, epoch length 7, starting point x° € R”, batch size r > 1 
(c= 3) =x" 
for s=0,1,2,... do 
for k=0,1,2,...,7—1 do 


Uniformly randomly pick set J, from {1, ...,m} (with replacement) such that |J,| =r 
gi =} x (V fils) — V fils) + VF (Gs) 
Lelk 
xytt = xf — hg" 
end for 
Geri 2). =e, 
end for 


In previous subsections, we already mentioned that to find e-stationary GD and 
SGD require® O (me~*) and O(¢~*) calculations of the gradients of the summands, 
respectively. Despite the fact that SAGA and SVRG were initially analyzed only 
in strongly convex cases, now their convergence in non-convex case is also well- 
known due to [207, 208]. Unfortunately, when r = 1, both SAGA and SVRG 
guarantee only O(me~*) convergence rate as simple GD. However, if r = m7, 
then SAGA and SVRG converge with the rate O(m7e~7), which has m/? times 
better dependence on m than the complexity bound for GD. 

However, the lower bound is £2 (./me~’) [91, 164], and there exist optimal 
algorithms. Essentially, these methods are variations of SARAH [193]. However, 
in the original paper on SARAH for non-convex problems, the authors did not 
prove complexity bounds for the finite-sum optimization problems. After that, in 
[91], the authors proposed the first lower bounds in the small data regime m = 
O(L?(f(x°) — f*)e~*) together with the first optimal method called SPIDER. 
Despite the theoretical optimality of the method, it requires very small stepsize 


6 For simplicity, we neglect all parameters except m and ¢, see the details in Table 2. 
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(proportional to ¢~!) that leads to the poor behavior in practice. Moreover, the 
original proof of the convergence rate for SPIDER is technically tough, and, because 
of it, it is hard to generalize the method for the composite optimization problems. 
In recent works [253, 254], much simpler optimal method called SpiderBoost 
was proposed (see Algorithm 7). Moreover, this method works with big constant 
stepsizes (of order L~!), can be easily generalized for the composite optimization 
problems, and works well with heavy-ball momentum. 


Algorithm 7 SpiderBoost [253, 254] 


Require: learning rate h > 0, epoch length T, starting point x° € R”, batch size r > 1, number 
of iterations K 
for k =0,1,2,... do 
ifk mod T = 0 then 
Compute gk =Vf(r) 


else 
Uniformly randomly pick set J, from {1, ..., m} (with replacement) such that |J,| =r 
Compute g* = + Dy (VAG _ V fia) + gkl 
iel, 
end if 
xktl — yk — pgk 
end for 
Pick € uniformly at random from {0,..., K — 1} 
return x& 


Next, in [164], the same lower bound {2 (./me~*) was derived without any 
assumptions on m. Furthermore, the authors of [164] proposed a new optimal 
method called PAGE (see Algorithm 8), which is a variant of SPIDER with random 
length of the inner loop making the method easier to analyze. 


Algorithm 8 ProbAbilistic Gradient Estimator (PAGE) Algorithm [164] 


Require: initial point x, stepsize h, mini-batch size r, r’ < r, probabilities {pz},>0 € (0, 1] of 
large-batch stochastic gradient computation, number of iterations K 


g => Vv fi (x), where Jp denotes indices in the mini-batch, |/o| = r 
tel 
for k=0,1,2,...,K —1 do 
xkt+l — yk — gk 
2 Vfi(xk*!) with probability px, 
git = where |J,| = r 
& +4 (Vf) —V Ace) with probability 1 — px, 
iel, 
i= 
end for 


return ¢* chosen uniformly from {x*} -_ 


However, in deep neural networks training, variance-reduced methods work 
typically worse than SGD or SGD with momentum [72]. This happens often due 
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to the bad behavior of variance-reduced methods with several widespread in deep 
learning tricks such as batch normalization, data augmentation, and dropout (see 
the details in [72]). Moreover, if the model is over-parameterized or, in particular, 
expected strong growth condition (15) or its relaxed version (16) with small noise 
level hold, SGD is as fast as GD in terms of iteration complexity, meaning that 
variance reduction is superfluous. That is, variance reduction trick is often not 
needed or gives worse rates than the rate of SGD for over-parameterized models 
from theoretical and practical perspectives. Nevertheless, when the problem is not 
over-parameterized, it makes sense to use variance-reduced methods. 

We summarize the above-discussed complexity bounds in Table 2. We also want 
to mention some papers not presented in Table 2 but being highly relevant. In 
[165], there was developed the generalization of the approach from [142] providing 
a unified analysis of different variants of SGD, non-optimal variance-reduced 
methods such as SAGA or L-SVRG [130, 152], and some distributed methods 
with quantization [4] including DIANA-type variance reduction [131, 179] for 
non-convex optimization. Next, for the online case (4)+(5) with smooth stochastic 
trajectories, the optimal rate O(e~*) was shown for STOchastic Recursive Momen- 
tum (STORM) method [68], which does not require periodical large-batch stochastic 
gradient computations and is more robust to the parameters selection, and for its 
proximal variant [262]. These results shade a light on the role of momentum in the 
stochastic first-order methods. Finally, it is optimal to generalize SPIDER and gets 
similar rates for composition optimization problems [59, 273]. 


Table 2 Overview of the complexity results for different variance-reduced methods applied to 
solve problem (4)+(6) with L-smooth summands. The column “Complexity” contains an overall 
number of stochastic first-order oracle calls needed to find ¢-stationary point neglecting constant 
factors. Notation: Ag = f(x°) — fe, Ay = 2 yf. — fix), ©” =a uniform bound for the 
variance of the stochastic gradient (11) (can be oo for variance-reduced methods), r = batch size 


Method Citation Complexity 

Lower bound [91, 164] LAg min{oe3, Jme~7} 
GD mL Age? 

SGD, bounded var. [102] LAp max{e~2, o2e~4} 
SGD, unbounded var. [142] Be max {Ao, Ax} 
SVRG,r = 1 [208] mL Age~* 

SVRG, r = [m7] | [208] mL Age 2 

SAGA, r = 1 [208] mL Age 

SAGA, r = [m7] [208] mL Age 2 
SpiderBoost [253, 254] m'L Age 2 
SpiderBoost-M [254] m'?L Age 2 

SPIDER [91] LAomin{oe—3, /me~?} 


PAGE [164] LAomin{oe—3, /me~?} 
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4.3.1 Convex and Weakly Convex Sums of Non-Convex Functions 


There are also several results devoted to the case when the objective function f 
from (6) is (strongly) convex or almost convex, while the summands /f; are smooth 
but can be non-convex. In particular, [284] establish the lower bounds for the cases 
when (i) f is jz-strongly convex with jz > 0, (ii) f is a-weakly convex 


f(x) — fQ) —(VFO)."—y) = —S lx <9, 


and (iii) f; are w-weakly convex. Due to the additional assumptions on the structure 
of non-convexity in the problem, the proposed lower bounds are tighter in these 
situations than the lower bound from [91, 164]. The lower bounds for the case (i) 
were further tightened in [261]. Moreover, there exist optimal and almost optimal 
methods for each case, see Table 3 for the details. 


4.4 Adaptive Methods 


One of the most significant issues of the methods described above is that they 
require tuning of the stepsize and other parameters (e.g., batch size) when used 
in practice. It is often challenging and takes a lot of time, especially for training 


Table 3 Overview of the optimal convergence results for convex and weakly convex sums of non- 
convex functions. Averaged L-smoothness of { f;}/_, means that for all x, y € R” the following 
inequality holds: - WIV AG) -VAGQ) \13 < L?\|x—-y \I3. The column “Lower Bound” states 
for the number of stochastic first-order oracle calls needed to find such x that E[ f (*)— f(x*)] < e¢ 
for the second and the third rows and ¢-stationary point for the fourth and the fifth rows. Notation: 


Ro = distance from x° to the solutions set (for the third row), Ag = if (x9) — te 


Settings Lower bound Upper bound, methods 
f is j-str. cvx. (m+ ms [L) log 40, (m+ ms /L) log 40, 
and L-smooth, Dual-Free SDCA [221], 
{ fi} are average L-smooth [261 Katyusha X [7] 
f is cvx. m+mi4y/ a m+mi4y/ BM 
and L-smooth, Dual-Free SDCA [221], 
{ fj} are average L-smooth [284 Katyusha X [7] 
f is a-weakly cvx = men [ms VaL, vit}, 2 _ {ms Obs vit}, 
and L-smooth, RepeatSVRG [1, 49], 
{fi}7_, are average L-smooth [284 SPIDER [91], 
SNVRG [286] 

(fi). are a-weakly evx. 4 min | VmaL, L}, * min [vmal, Jmc}, 
and L-smooth Natasha [5], 

[284] RapGrad [157], 


Stagewise Katyusha [58] 
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deep neural networks. That is why, in the recent few years, adaptive methods 
gained a lot of attention. Below we discuss the most popular ones—AdaGrad and 
Adam—as well as their variants. In fact, all of these methods depend on some 
parameters, but these algorithms are much more robust than other variants of SGD 
or variance-reduced methods. Therefore, they are often called adaptive. One can 
find PyTorch implementation of many popular adaptive first-order methods together 
with visualization of their convergence on Rosenbrock and Rastrigin functions in 
[63]. 


4.4.1 AdaGrad and Adam 


AdaGrad As we mentioned above, SGD requires the tuning of the stepsize. The 
first algorithm aiming to remove this drawback of SGD was AdaGrad [80]: 


h 
xf) = at - gf, 
JGit+s 


where the subscript i denotes the i-th component of the vector, Gk = ys (gi y, 
and 6 is some small positive number preventing from the division by zero and 
typically taken of the order 10-8. AdaGrad can be considered as a special case 
of SGD with different per-coordinate stepsizes. 

The main advantage of AdaGrad is in its robustness to the choice of h: in practice, 
it often works well with the default value h = 1072. Moreover, AdaGrad was shown 
to work well with sparse data [81]. However, in the dense settings AdaGrad stepsizes 
rapidly decrease, which leads to the slow convergence of the method [258]. 


Adam To resolve this issue of AdaGrad, one can use exponential moving averages 
instead of sums Gt leading to the method called RMSprop [244]. Then, based on 
RMSprop, the authors of [147] proposed one the most popular methods in deep 


learning Adam:7 
k k-1 k = 
mi = Bim)! + (L— Bist if = 
k k-1 fy? oF 
vu; = Pov; ~ + — Bo)(g;)°, = 1— (po)?’ 


kt => 


wg ; 
a 
Joxrt+é 


7 To distinguish exponents from superindexes, we use braces (-) for exponents. 
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where 6 is some small positive number preventing from the division by zero and 
typically taken of the order 10~8. Default values 6, = 0.9 and £2 = 0.999 from 
the original paper [147] often make Adam work well in practice. Adam was initially 
analyzed in the online convex case, but then the authors of [209] found out the flaw 
in the proof for Adam and proposed a convergent variant of Adam called AMSGrad. 


Convergence Guarantees While the superiority of AdaGrad and Adam in com- 
parison to SGD was noticed in many application [81, 107, 155], the best-known 
complexity bounds for AdaGrad, Adam, and their modifications are the same or 
even worse than ones for SGD [60, 75, 257, 271, 285]. Furthermore, these complex- 
ity results in non-convex case under more restrictive assumption, e.g., uniformly 
bounded second moment of the stochastic gradient, than their counterparts for SGD. 
Among other works providing complexity results for Adam and AdaGrad in the 
non-convex case, we emphasize [75] because of the generality and the simplicity 
of the proofs. Moreover, the unified analysis of proximal variants of AdaGrad and 
Adam was proposed in [270]. Furthermore, we emphasize the recent work [226] 
where the authors analyze RMSprop without assuming uniform boundedness of the 
gradients. 

Next, in [277], the theoretical and empirical study why Adam sometimes behaves 
significantly better than SGD was conducted. The authors of [277] empirically 
discovered that Adam performs better than SGD when stochastic gradients are 
heavy-tailed and the reason is that Adam does an “adaptive gradient clipping” 
(108, 110, 114, 178, 201, 247]. In the same work [277], the authors showed that 
in such situations SGD can fail to converge while clipped-SGD (with general 
and coordinate-wise clipping operators) provably converges to €-stationary point. 
Moreover, in [276], it was shown that gradient descent with clipping converges even 
under weaker assumption than L-smoothness in the non-convex case with the rate 
~ €~*, while gradient descent in the same settings can converge arbitrary slower. 
Then, the bound from [276] was improved in [275]. Finally, it is known [108] that 
clipped-SGD works better than SGD in the vicinity of extremely steep cliffs. A very 
similar approach based on the normalization of gradient descent was also studied in 
[127, 161]. 


4.4.2 Adaptive SGD 


The approach described in Sect. 4.1.2 for general stochastic optimization prob- 
lem (4) with the objective given as (5) was recently extended in [82] to obtain 
adaptive methods with Armijo-type line search for stochastic non-convex opti- 
mization. To do that they consider Algorithm 3 with the mini-batch stochastic 
gradient (10) and mini-batch size r = max{1, 809/c2}, where o9 > o. In each 
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iteration k of Algorithm 3, the stepsize is taken as hy = !/Ly := '/(2ik-!Ly_1) by 
increasing ix > O until the inequality 


e2 
32L% 


fo) < fot (+ Sov f(x, &), x44! — | + Lyljx®t! — x5 + 


1=1 


is satisfied. This inequality is an inexact upper quadratic bound that follows 
for sufficiently large L,; from the L-smoothness and bounded variance. Thus, 
Lx plays the role of a guess of the Lipschitz constant L locally between the 
points x* and x*+!. The authors of [82] propose also methods for convex prob- 
lems based on the same idea with the difference that in the convex case the 
mini-batch size r depends on the iteration counter k. Careful choice of this 
dependence allows to simultaneously adaptively choose both the stepsize hx and 
the mini-batch size rz. These methods have the same, up to logarithmic factors, 
iteration complexity, and a total number of stochastic oracle calls as their non- 
adaptive counterparts. In particular, for the non-convex case, the iteration complex- 
ity to obtain ¢-stationary point is O (LU (2°)—fa)/e?), and the oracle complexity is 
O (L ( Ff - fe) max {1/e?, o*/e4}), Moreover, empirically, the methods designed 
for convex problems turned out to be more efficient on non-convex problems than 
the method designed for non-convex problems. 


5 First-Order Methods Under Additional Assumptions 


In the previous parts of the paper, we focused on general non-convex problems. 
In this section, we consider two subclasses of non-convex objective functions that 
satisfy assumptions weaker than convexity and, at the same time, strong enough to 
obtain good global convergence rates of optimization algorithms. For simplicity, we 
consider an unconstrained optimization problem (1) with Q = R”. 


5.1 Polyak—Lojasiewicz Condition 


A function f (x) is said to satisfy the Polyak—Lojasiewicz (PL) condition [170, 202] 
(or to be gradient dominated) if for all x € R” 


1 
f (x) — f (x*) < oT IVE GIG. (26) 


This condition implies that any stationary point of f(x) is a global minimum, 
although it is not necessarily unique. In particular, this property holds for strongly 
convex functions. It was first shown in [202] that if the objective is also L-smooth, 
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then gradient descent linearly converges to a global minimum, i.e., 
bb 
Fox) — f(e*) < exp(—Fe) (¢(=°) - £6). 


The Polyak—Lojasiewicz condition is naturally satisfied for the problems of solving 
nonlinear systems of equalities g(x) = 0, where g(x) is a vector-valued function. 
This problem can be equivalently reformulated as 


; 1 2 
a {ro =5 ie cot] . 
Assuming that, for all x € R”, 
min (Je(x)Jg (2) = > 0, 
where J,(x) is the Jacobian matrix of g(x), one can show that 


IVE @IP = IZ @s@I? > wllg@ll? = 2uf(), 


which is exactly the Polyak—Lojasiewicz condition since g(x*) = 0. An extensive 
survey of first-order optimization methods under this condition, as well as its 
relationship with other classes of functions, can be found in [141]. An interesting 
example of the emergence of PE condition in Linear Feedback Control theory was 
recently described in [93] and over-parameterized deep learning in [21]. 

Next, consider the convergence of gradient descent under the PL condition in 
terms of relative accuracy V Ff) 


IVf@) —VF@lk2 < allVf@ll2, 
where a € [0, 1). Let the stepsize h in gradient descent 
xh = yk _ AD perky 
be computed using the following formula: 


1 l-a 
k= ———_... 
L (1+a)? 


Combining this with the Lipschitz condition, we obtain 


ae me AA ey ky 12 
FAP) SFO) — ay rlV FeO. 


leading to 
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N * (1 — a)? ‘ 0 * 
fx )- Fa) = (1- FESS) (70° — f°). 


As a result, we achieve a linear convergence rate for the gradient descent under the 
PL condition. 

In general case, the main ingredient that guaranties global linear convergence 
under PE condition is an estimate like 


IVF@™)IB < 00) - (F@) - f@), 


where 0(N)—some decreasing function, i.e., (2). We assume that there exists such 
N(w), that 6 (N(2)) < pw, 1.e., for (2) N(w) = 2L/. In this case from PE condition 


IV fe). 


1 
IV ec ())IG < 5 


By applying restarts, we obtain oracle complexity O (N(1)). 


5.1.1 Stochastic First-Order Methods Under Polyak—Lojasiewicz 
Condition 


The majority of the methods described in Sect. 4 is analyzed under PL condition as 
well. That is, one can find the state-of-the-art results for different variants of SGD 
and non-accelerated variance-reduced methods such as SVRG and SAGA in [165], 
accelerated variance-reduced methods such as PAGE in [164], the tightest known 
analysis of random reshuffling under PL condition in [2], and the convergence 
results for SGD in the over-parameterized case with constant, Armijo-type, and 
stochastic Polyak’s stepsizes in [249, 250], and [169], respectively. The summary 
of known complexity results for the stochastic methods under PL condition is given 
in Table 4. We emphasize that the analysis from [118] is derived under the so-called 
expected residual (ER) assumption on the stochastic gradient g(x): there exists 
such constant p > 0 that 


: 2 

s[ |e) — 80) — (VF@) — F@)[3] <20(F@)- FO). 2D) 
Moreover, in the analysis of random reshuffling from [2], it is used that the norms of 
the gradients of individual functions from the sum (6) are uniformly upper by some 
constant G on the sublevel set: 


IVA@Olle<G, Viel,...,.m, Vx ER": fa) < fa’ (28) 


Rather simple introduction (close to the state-of-the-art results) for SGD with 
bias under PL condition can be found in [3]. 
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Table 4 Summary of the state-of-the-art complexity results for different stochastic first-order 
methods under assumption that f is L-smooth and satisfies PL condition (26). Columns: 
“Complexity”—an overall number of stochastic first-order oracle calls needed to find such x 
that EL f(x) — f(x*)] < neglecting constant factors; “Assumptions”—the assumptions used to 
derive the corresponding complexity bound in addition to L-smoothness of f and PE condition 
(26) for f. For finite-sum case (6), it is additionally assumed that each f; is L;-smooth, 
i = 1,...,m. Abbreviations: UV—uniform variance bound assumption (11); RG—relaxed 
growth aontiinian (16); Avg. Layg-smth.—averaged L-smoothness assumption meaning that there 
exist such L that EV, &) - Vi(y, E15] < Lx — yIl5 in the online case (5), and 
q LVF; (x) -Vfj (y)I13 | < L? \|x — yll5 in the finite-sum case, where j is sampled uniformly at 
random from {1,...,}; Unif. Sampl. and Imp. Sampl. denote the sampling strategies described 
in Sect. 4.2.1. Notation: Ag = f (x9) — fs o” = auniform bound for oe variance of the stochastic 


gradient (11); a, 6 = relaxed growth condition parameters; A, = 2 =1 (fe — fix); max; Lj 
= maximal smoothness constant of f; in (6); L = averaged ee ene constant of fj in (6); 
o2 = E[|lg(«*)||2]—1the variance of the stochastic gradient at the solution 
* & 2 g 
Problem | Method Citation Complexity Assumptions 
(4) GD [202] £ log “0 
A 
(4)4(5) | SGD [141, 142] | £ Jo (2) + 4s UV (11) 
aL 0 
[142, 249] ae log( =) + We RG (16) 
2 ZL A 
PAGE [164] (s ty 3 wt) (2) UV (11), Avg. 
Layg-smth. 
(4)+(6) | GD [202] m4 log 4 Ao 
A 
SGD [142] f (G - 8) log ( a) &) ES (14) 
a LB 
[142, 249] | 8b tog (42) + RG (16) 
[142] E (es Ei log (2) ! man Tas Unif. sampl. 
[142] E ke log 4o } fake Imp. sampl. 
oe p A fon 
[118] L ( (£41) log (42) + % ER (27) 
dase 7 A 
+ Armijo line | [250] (% + mats “) log ( 2) E-SG (15) 
search 
; v2 
+ Polyak [169] —t log (2) Interpolation (21) 
stepsizes 
mA mL2G? log? (e7!) me 
RR [2] ( =o 4 3 - ) Bounded 
i gradients (28) 
3 
SVRG [207,208] | (m+ "maxi 47 ) jog (2) 
] 
L-SVRG [165, 208] (m 4m rc) log (42) Avg. Layg-smth. 
SAGA 
PAGE [164] (b+ b=) log (42), where UV (11) with 
b saieere a? < +00, Avg. 
= min{ 73+) Layg-smth. 
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5.2  Star-Convexity and a-Weak-Quasi-Convexity 


A function f(x) is called star-convex if for some global minimizer x* and for all 
d € [0, 1] and x € R” 


fx + (L—A)x") < Af) +L - AVF"). 


While any interval connecting two points on the graph of a convex function lies not 
lower than the graph, for a star-convex functions, this is assumed only for intervals 
connecting some fixed global minimizer and any other point on the graph. This 
condition is considerably weaker than convexity, even for functions of one variable. 
For example, the function |x|(1 — e~!!) is a non-convex star-convex function. The 
authors of [159] analyze a cutting plane method for minimization of this class of 
functions and obtain a polylogarithmic in ¢ and polynomial in n complexity bound 
using only function evaluations. The authors of [122, 190] prove that the same 
Algorithm | possesses the following convergence rate for star-convex L-smooth 
functions 


i Kx 64L2V [x9] (x*) 
coping robi2 < SEV) 
ALVI x91 (x* 
Fe") — FO") < —. 


A more general class of functions is the class of a-weakly quasi-convex functions 
satisfying 


f@)— F(x") < -(VF@) 2-2") 


1 
a 
for some a € (0, 1] and some global minimizer x*. Continuously differentiable 
1-weakly quasi-convex functions are exactly the star-convex functions. The authors 
of [121] propose an algorithm with iteration complexity O(a~!L!/? Re—'/*), where 
R is an upper bound on the initial distance to the point x*. A slightly worse 
bound O(a~7/2L!/? Re—!/2) is obtained in [190] by restarting Algorithm 1. Both 
approaches require a line search for which the complexity also needs to be esti- 
mated. The authors of [129] analyze this complexity and propose an algorithm with 
O(a~!L'/*Re—'/*) iteration complexity and the same up to a logarithmic factor 
in a~'e—! number of function and gradient evaluations. Moreover, they provide a 
similar lower complexity bound, thus proving that their method is optimal. Further, 
they also consider a class of (a, jz)-strongly quasi-convex functions satisfying 


1 
FO-F() < (VF @.x-2")- Sle? 
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and provide an algorithm that has iteration complexity 
O(a L/2-"/? log(a7!e7!)) 


and requires up to a logarithmic factor the same number function and gradient 
evaluations. Similar optimal complexity bounds for accelerated gradient method for 
a-weakly quasi-convex functions and (a, ,)-strongly quasi-convex functions were 
obtained in [39] by extending the estimating sequence technique. 


5.2.1 Stochastic Methods and a-Weak-Quasi-Convexity 


The most general analysis of SGD under a-weak-quasi-convexity is provided 
in [118]. As it was mentioned earlier, the authors of [118] consider finite-sum 
optimization problems® (4)+(6) and derive complexity bounds for SGD under 
expected residual (27) assumption on the stochastic gradient for the a-weak-quasi- 
convex function and functions satisfying PE condition. In particular, for SGD in 
these settings, the following bound was established: 


. (¥ +E)RS “tt 


aze a2¢2 


where a2 = Ell g(x*)|I5] is the variance of the stochastic gradient at the 
solution. Note that when interpolation condition (21) holds, this bound reduces 
to O ((o+L)RG/(a2e)). Moreover, under interpolation condition, the authors of [118] 
also derived that the generalized version of stochastic Polyak stepsize (23) for 


stochastically reformulated problem (4)+(6) converges with the rate 


‘ LR: 
aze }’ 


where # is the expected smoothness constant of stochastic reformulation (see the 
details in [117, 118]). In the full-batch case, i.e., when g(x) = V f(x), we have 
< = L, and in the importance sampling case, i.e., when g(x) = V fj(x), where 
j =i with probability Li/”, L:, we have Z=L= 1°", Lj. 


m 


5.2.2. Further Generalizations 


A more wide class of functions that covers the class of w-weakly quasi-convex func- 
tions referred to as approximately homogeneous functions satisfying the condition 


8 Tn fact, most of the results from [118] do not rely on the finite-sum structure of f. 
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NC (a) — f (x*)) < (af @) x — 2") < MU) — f(x"), 


where Of (x) is a subgradient of f(x) and N, M are some constants. This class of 
functions was first defined in [229] and discussed in [204]. 

In general, if there exist good lower and upper convex models for non-convex 
target function, one can derive that complexity of such problem is similar to convex 
ones rather than non-convex (see [20] and references therein). 


6 Higher-Order Methods 


6.1 Second-Order Methods 


Another branch of optimization incremental methods for solving (4) are methods 
that use the second-order information about the function. This information is very 
helpful to escape saddle points by using a negative curvature. Next we define an 
(€, 6)-second-order stationary point x* if 


IVF@Ib Se, Amin (V?F(*) = —8. 


Next in this section, we suppose that f(x) has L2-Lipschitz second-order 
derivative. The basic method for this class of problems is a Cubic Regularization 
method (CR) [188]. 


1 H 
xith = x* + argmin [yrotyts 4: 5 VS )s + an ; (29) 
seR" 


where H > 0. It globally converges to the minimum for convex functions and 
converges to a (¢€, 6)-second-order stationary point for non-convex function within 
O(e—3/*) number of iterations. Note that the subproblem (29) is also non-convex, 
but in [188], the authors proposed a method to solve this problem as a convex 
problem via special choice of H and line search for a dual problem. A related line of 
work considers trust region methods [52—55, 65], where a classical Newton step is 
calculated on a Euclidean ball of a carefully chosen radius. Both cubic-regularized 
Newton methods and trust region methods can be extended to work for constrained 
problems with linear and conic constraints [84, 85, 125]. In general, all these 
algorithms work well for the problems in moderate dimensions. Unfortunately, for 
many large-scale machine learning problems, it is hard to calculate the full Hessian 
and the inverse of such a large matrix. Recent work has therefore explored the use 
of Hessian-vector products V7 f(x) - s, which can be computed as efficiently as 
gradients in many cases including neural networks by using autogradient technique. 
By this Hessian-vector product, we can efficiently find ee by variants of gradient 
descent [47]. Several algorithms incorporating Hessian-vector products [6, 8] have 
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been shown to achieve faster convergence rates than gradient descent in the non- 
stochastic setting. However, in the stochastic setting where we only have access to 
stochastic Hessian-vector products, significantly less progress has been made. 

One of the improvements of this method was done in [256]. The authors 
introduce a momentum step and obtain faster convergence rate. This technique is 
widely used to speed up the first-order methods and can also speed up the second- 
order method. 


Algorithm 9 CRm 


: Input: Initialization x° = y° € R", p <1, H > Lo. 
: fork =0,1,... do 
Cubic step: 


oi 


. 1 H 
s+! = aremin [wrcty's + 5° Vf") + Fist] , 
Ss 


etl pk 4 gkt1 
4: Momentum step: 


Brevi = min{p, ||V f(y") Ilo, lly? — x* Ila}, 


ghtl = yet! + Beri cyt! _ y*), 


DF Monotone Step: 


k+l 


x argmin f(x). 


xefyht!, k+l} 


6: end for 


Also, second-order methods that have access to the Hessian of f can exploit 
negative curvature to more effectively escape saddles and arrive at local minima. To 
show this concept, we introduce one of such methods [259]. There are two types of 
steps: gradient steps and a step in a negative curvature for the Hessian. So: 


* If ||V f(x*)|l2 > &, we do gradient step. 
¢ Otherwise, if Amin (ve f (xk )) < —6, choose s* to be the eigenvector correspond- 


ing to Amin (V? f0x*))) and do step x*+! = x* + axs*. 


There are different policies to a, and gradient steps. The main idea here is to use 
the first-order methods as a cheap main method and switch to expensive second- 
order methods when we reach local stationary point and want to escape it to find 
a better local minimum. Methods with this idea are still developing. In [101, 137], 
it was proved that gradient methods with additive noise are able to escape from 
nondegenerate saddle points and find approximate local minima. These ideas lead 
to the state-of-the-art first-order methods to find local minima with Hessian-vector 
product [8, 9, 49, 91, 138, 192, 213, 264]. In recent works [92, 139, 212], it was 
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proved that stochastic gradient descent can escape from saddle point and converges 
to approximate local minima. 


6.2 Stochastic Second-Order Methods 


Now we move to stochastic version of problem (3). First, we speak about online 
version (5), where we minimize expectation of some stochastic function. In the work 
[245], the authors propose a stochastic optimization method that utilizes stochastic 
gradients and Hessian-vector products to find an (¢, 5)-second-order stationary 
point using only O(e~*>) oracle evaluations. This rate improves upon the O(e~*) 
rate of stochastic gradient descent and matches the best-known result for finding 
local minima without the need for any delicate acceleration or variance reduction 
techniques. 


Algorithm 10 Stochastic cubic regularization 
Require: mini-batch sizes r1, rz, initialization x9, number of iterations N, and final tolerance e. 
1: fork =0,...,N do ; 
2: Sample S < {&;}/1., So <— {&)}7). 
3: gt=t Ves; VEG &) 
400 BEL De cs, VFL EDO 
5 s* = argmin [vi (s)=sl gk + 58' Bs + 2 isi3f 
Ss 


6: kt] ge yk 4 gk 
7: end for 
Ensure: The final iterate x)+1. 


This is a stochastic cubic regularization algorithm in Algorithm 10. To obtain 
stochastic gradients and Hessians, we can sample independent batches of Sjand $2 
in each iteration, but they can also be connected so that Sz C S$ . The average 
gradient is denoted by 


1 
gh=— D7 VF" &) 
ges 
and the average Hessian by 
1 
- yee: 
& €S2 


and this implies a stochastic cubic submodel: 


130 M. Danilova et al. 


. 1 - Lo 
Ua(s) = sT gh + 58'Bks + —Ilsi 


This subproblem should be solved by special gradient-based subroutine. It is 
written in detail in [245]. Since only the gradient is used to solve the subproblem, 
we need to compute only a Hessian-vector product B*[s] but not a full Hessian 
B*. If our function can be represented by a computational tree, then we can use 
autogradient techniques and compute Hessian-vector products as fast as we compute 
gradients up to a small constant. 

How many Hessians should we take? By concentration inequalities, it is possible 
to show that we need 


(Si=n= 0 (<-') 


So in total, the method converges with O (e~3/ ) iterations and O (e->/ *) 
Hessian calculations of the function. 

In paper [16], this approach is improved by using special variance reduction 
technique. The authors get method that needs only O(e~*) gradients and Hessian- 
vector products for finding second-order stationary point. Also, in this chapter, the 
authors prove lower bounds for higher-order stochastic problems. 

What is the main advantage of such methods? We calculate fewer Hessians 
than in the full CR version and also do it in parallel if we have many cores for 
computing. The simplicity of the algorithms, both at fast rates and when escaping 
from saddle points, leads us to very good optimization methods for non-convex 
stochastic problems. 

Next we go to offline version that works with sum of functions (6). 


1 m 
f= 7 2, AG). (30) 


where f; (x) has Lipschitz continuous Hessian. In this regime, we have m functions, 
and hence, classic CR needs to compute O(me~>/*) Hessians. To reduce it in 
papers [150, 265], the authors used subsampled gradient and subsampled Hessian, 
which achieve O(me—3/2 A e712) gradient complexity and O(me—3/2 A g7 9/2) 
Hessian complexity similarly to the previous section. Next appears many articles 
with different stochastic variance-reduced cubic (SVRC) methods. To collect these 
results in one place, we add a table (see Table 5) with the convergence rates, where 
aA b=min{a, b}. 

As aresult, we have a method that not only works efficiently with the big sum by 
utilizing stochastic nature but also employs Hessian information to escape saddles 
more effectively and arrive at to better local minimum. This statement is supported 
by the experiments described in [177, 197, 200, 266]. The authors of these papers 
experiment with various second-order methods and show how they compete with 
first-order methods without any second-order information in practice. These papers’ 
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Table 5 An overview of the number of computations of gradients and Hessians of functions in 


(30) 


Method Gradient Hessian 
CR [188] —[O(m-e 32) O(m-e 3?) 
SCR [150, 265] (m-e~3/? A @7/?) (m- e737 A @~9/?) 
SVRCI [287] (m4/9 . e~3/) (m4/> . e~3/?) 
SVRC2 [255, 288] (m-e~3/?) (m?/3 . e~3/?) 

(m2/3 3/2) 


STR [223] 63/2 A mi/? . @~?) 
SRVRC [289] 


0 6 
0 6 
0 6 
SVRC3 [274] O (m+ ¢~3/?2 p m2/3 . 6-5/2) 0 
0 6 
0 6 
Lower bound [89] Q Q 


main conclusions are that second-order methods find deeper local minima and avoid 
saddle points. They are more robust when hyper-parameters are used. Subsampling 
speeds up computations and allows for the parallelization of such methods. As 
a result, second-order methods may be competitive with first-order methods in 
practice. 


6.3 Tensor Methods 


Next, we present high-order or tensor methods for finding local minima of a highly 
smooth and non-convex objective function. High-order derivatives better describe 
functions and enable you to use curvature to improve convergence. 

First, we lay out some standard assumptions about the smoothness of the function 
Ff. In the following, we will denote the directional derivative of the function f at x 
along the directions h/ € R”, j = 1,..., pas 


V? f(x)[al,..., A? I. 
For instance, V f (x)[h] = Vf (x) "hand V? f(x)[h]?? =h' V7 f(a)a. 


The functions jf; for each p = 0,...,3 have L,-Lipschitz-continuous deriva- 
tives, 


IV? fix) — V? fiOll2 = Lpllx — yll2 


for all x, y € R”. 
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From this inequality, we get next tensor method for p = 3, 
k+1 k : k Loe pial. OR iets 2 ad 
XTensor =X + argmin | V f(x")[s] + ay f(x" )Es¥ + a. FAST + gy lisila | 
seR" : 


In papers [29, 50, 51], it was proved that tensor p-order method with Taylor 
approximation is optimal, matches lower bounds, and converges with the rate 
O(e~'?+)/P) for non-convex problems; hence for the third-order methods, we get 
the rate O(e—*/) instead of O(¢~3/*) for the second-order methods. So, we get that 
third-order methods are faster than second-order methods in terms of iterations. 

Another crucial motivation is that the second-order method could get stuck at 
the so-called degenerate saddle point, where the Hessian matrix has non-negative 
eigenvalues with some eigenvalues equal to 0 [14]. 

In paper [290], it is shown how gradient descent and cubic regularization method 
stuck in such points for even small problems, such as f(x,y) = x° — 3xy? in 
degenerate saddle point (0, 0). So, we should use third-order information to escape 
them. 

This leads us to the third-order critical point. We define next critically measures 


Xf1%K) = IV f xx) llo; 


xp.2(%) = max {0, —Amin (V7 fCx)) | 


’ 


xpaler) = max |V3 f@x)Ly? 
yEZe+1 


where Zx+1 is the kernel of v2 Ff (xx). Then, we define x* a (€1, €2, €3)-third-order 
critical point if 


Xfire) S€1,  Xp2%e) S€2,  XF3(XK) S €3. 


Third-order method converges to a (€1, €2, €3)-third-order critical point with the 


4/3 2 4 
»€o »&3 


rate O (max (ey 

But the calculation of the third-order derivative would be very computationally 
expensive. This problem leads us to stochastic tensor methods. The main idea of the 
stochastic method that by different concentration inequalities, we can compute much 
fewer Hessians and third-order derivatives for sum-type problems than gradients. 
Correct proportions are written in (35). For example, if we have 200,000 functions 
in sum, we may compute full gradient, only 10,000 Hessians, and 100 third-order 
derivatives and get the same speed as for full Hessian and full third-order derivatives. 

Reference [172] introduce such method that works with batch tensors and 
converges as fast as for full-batch methods. The optimization algorithm we consider 
is detailed in Algorithm 11. This algorithm uses subsampled derivatives instead 
of exact quantities, and its implementation relies on tensor-vector products only. 
The proposed approach is shown to find an (€1, €2, €3)-third-order critical point 
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in at most O (max (e - E> a &3 )) iterations, thereby matching the rate of 


deterministic approaches. 
We construct an inexact Taylor approximation model and add a fourth-order 
regularization defined as 


dx(s) = fF) + #1 + SBN? m ary, 


Ay 4 
Wi (s) = Ox(s) + Fe llstlas (31) 


where gk, B*, and T* approximate the derivatives V f (x*), V7 f(x), and V? f Ge) 
through sampling as follows. Three sample sets S®, S?, and S’ are drawn, and the 
derivatives are then estimated as 


‘= Vfi(x*), B =o Yo Vv? fi), 


ieS8 ie S? 


‘= SO): 


ieS! 


It is worth mentioning that the implementation of the algorithm does not require 
the computation of the Hessian or the third-order tensor, both of which would 
demand significant computational resources, but rather directly computes tensor- 
vector products with a complexity of order O(n). 

We will make use of the following condition in order to reach an ¢-critical point 
(where ¢ = 1). For a given € accuracy, one can choose the size of the sample sets 
S8, 8°, S! for sufficiently small kg, Ky, K; > 0 such that: 


lex —V FO) lz < Kee, (32) 
(BE — V7 FO) silo < Kye? IIs], Vs ER", (33) 
ITs? — VFO) [sP lo < eet IIsI5, Vs eR". (34) 


In practice, we can choose the size of the sample sets S, S®, and S' as follows: 


ode kG Pa eee Sc 4 Py ee 
1g = =O Ke? , m=O xped/3 , =O Kee2/3 , (35) 


where O hides polylogarithmic factors and a polynomial dependency to n. We can 
see that due to the stochastic nature of the data and tensor concentration inequalities, 
we can use far fewer computations while still achieving the same convergence speed 
as a full-batch method. 
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Algorithm 11 Stochastic Tensor Method (STM) 


1: Input: 
2: Starting point x° € R” (e.g x° = 0) 
3: 0<"<1l<y<yw,1>n2>m > 0,and Hp > 0, Amin > 0 
4: fork = 0, 1,..., until convergence do 
55 Sample gradient g*, Hessian B* and T* such that Eq. (32), Eq. (33) & Eq. (34) hold. 
6: Obtain s* by solving yx (s*) (Eq. (31)). 
i i Compute fk + s*) and 
iG) =7G* +5) 
Sf (xk) — bk (s*) 
8: Set 
kth — xk sk if px >m 
xk otherwise. 
9: Set 
[max{Hmin, ¥i1 Hk}, He] if px > n2 (very successful iteration) 
A+i = } (Ak, v2 Ak) if no > px = 1 (successful iteration) 
[2 Ak, v3 Hk] otherwise (unsuccessful iteration). 
10: end for 


As shown in [89], the lower bounds for sum-type problem are still rather far from 
upper bound even for the second-order methods. Hence, further research in this 
area may lead to new methods for sum-type problems by using variance reduction 
techniques. Another branch of possible research is a combination of tensor methods 
with first- or second-order methods. 


7 Zeroth-Order Methods 


Gradient-free or zeroth-order optimization methods, which use only function values, 
are becoming increasingly important in machine learning problems, especially in 
reinforcement learning [176], black-box adversarial attacks on deep neural networks 
[198], and other problems with structure making gradients difficult or infeasible to 
obtain. 

While there is a class of methods that does not have any connection to the 
gradient, for example, random search algorithms [218] (which are one of the 
first methods of zeroth-order optimization, besides grid search), the Nelder-Mead 
algorithm [182], the model-based methods (see Chapters 2-6 and 10-11 in [66]), 
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or the recent stochastic three points (STP) method [26] and its momentum variant 
STMP [112], the most zeroth-order optimization methods use gradient estimations, 
such as g(x) = Ye [(fatHe)—f())/uJe; (where e; are columns of n x n les 
matrix J,, i € {1,...,}), and then for good enough functions (f € on , 1€., 
continuously differentiable with Lipschitz-continuous gradient), it can be shown, 
for example, that || g(x) — Vf (x)|l2 < L./n. One then can consider some first- 
order optimization scheme, replace actual gradients with their estimations, and use 
bounds like this to return to gradients from estimations in proofs, obtaining the 
results for the zeroth-order case relatively easy. 

While such deterministic zeroth-order schemes (like the GD with gradient 
estimation of the same form as above) often suffer from the problem dimensionality 
because of the number of oracle calls needed to reconstruct the gradient (n for the 
estimation mentioned above, see also [25] for other examples), in a randomized 
approach, one can use two- or one-point schemes of gradient approximation, which 
makes every iteration simpler, sometimes leading to better results in terms of oracle 
calls [167]. Another benefit of the stochastic approach is that such methods often 
have good theoretical properties, for example, the Gaussian smoothing approach 
[189] that gives a smoothed version of the initial function, for which the convergence 
of stochastic zeroth-order algorithm can be easily proved, which can be later used to 
show the convergence of the algorithm for the initial function. And there are setups 
(e.g., online learning [40]) where one is limited to use only several (or even one) 
oracle queries thus being unable to construct the full gradient approximation, so the 
stochastic approach becomes the only option. 

We begin with the formalization of these zeroth-order randomized schemes—we 
have a problem with the form 


min f(x) 
xeQCR" 


and then stochastic zeroth-order methods generate Or} s.t 
BY ACA X, Pty, fw 0) 


so the procedure A gives us x*+! based on function values (obtained via oracle 
f ), history of {x*}, random vectors {uv}, and parameters P such as dimension n 
of X, L, and v—Hélder parameters, etc. Function 7 is not necessarily equal to 
Ff, and we can, for example, use se f(x) = = f(x) + e(x) where |e(x)| « | f(x)|, or 
f(x,u) = f@) +e, st. Ef, wl = f@). 

In the subsections, we will agi the characteristics of several zeroth-order 
gradient estimations and then the zeroth-order methods for sum minimization type 
problems in a non-convex setup. Other information on gradient-free optimization 
(such as structured objectives) can be found in the recent survey [158]. 
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7.1 Random Directions Gradient Estimations 


Let us start with the methods following the standard zeroth-order scheme of using 
gradient approximation to benefit from the analysis of first-order methods. In this 
section, all methods have a form similar to the classic gradient descent 


kr — xk = hag (x*, u*) 


with only difference that instead of the true gradient we use the gradient approx- 
imation g(x, uw). One way to build such gradient approximations is to use random 
directions to compute finite differences in the form 


Fak + uu —- fC) 
(a 


g(x*,u*) = 


It makes sense to use centrally symmetric distributions for u‘, for example, 
uniformly distributed over the unit Euclidean sphere S”~! = {x € R” : ||x|l2 = 
1} (see [87, 96, 109]), or uk ~ W(0, I,)—the so-called Gaussian smoothing 
introduced in [189]. In this chapter, the authors proved Gaussian approximation 


1 fs 
Iu @) = -| ft puye 2". du 


(there k = fy, e-"l3/2qy = (27)") to have several good properties, such as 


convexity preservation - f is convex, then fj, is convex too), differentiability, and 


iffec i or f € ic i” aie e., Lipschitz-continuous function with constant Lo or 


function with Lipschitz-continuous gradient with L 1, respectively), then the same 
holds for f, with Lo( fy) < Lo(f) and Li (fy) < Li(f), respectively. It can also 
be shown that | f(x) — f(x)| < wLo./n for the case of f € Cy” 
While in that paper the authors mostly discuss the convex case, there are some 
results haa Section q for a non-convex objective f too. They comeee a process 
xkt+l = xk — py e(x*, u’), with g defined above, f = f and uk ~ VW (0, In), 
and show that for the case of f € ch Be , this process converges in the sense of 


Zu IV fu) llo (where U = {uk} 


Po 
ou 


k=0 


0) _ x 2 
ful) — f 4 3 oe | 
N 32 


uv [IV CIB] < 8+ 4)L1 
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and then using the fact that [189, Lemma 3] ||V f,,(«)-V f(@)ll2 < [HL1/2](n+3)”, 
we obtain (from ||V f (x)II3 < 2IIV fue) — VAIS + 21V ful) 


N-1 212 
By [IVee13] 2a +3) 


k=0 


2|= 


+ 16(n +4)L} 


0) _ x 2 
ke your a 3 aor 
N 32 


and choosing 1 = O (¢/[n32L1]), we ensure aya UU LIV £115] < e? with 
the upper bound for the expected number of steps N = O (”/c?). 
For the case of f € co 


= 1 r = 
tne u [tv ice] < 5 ci P+ on +43 yi i} 


they show only that this process converges to the stationary point of f),(x)— 
consider Q with diam(Q) < R, and then it can be shown that we need to make 


rae, (“ + i) 


64$ 


steps to ensure that Toy UU [IV fu (x*) 3] < «* keeping functional gap 
| fu(x) — f(*)| < 6 small. The authors also mention that with the hy, — O and 
4 — 0 the convergence in the sense of Ey||V f(x)||2 can be proved too. 

These results can be extended [227] to the case of noisy 1.e., | 7 (x)—f(x)| < 
for f with Hélder continuous gradient (||V f(x) —V f(y) ll2 < Lyllx — yl|;)—it can 
be shown that for a small enough noise 6, these convergence rates can be preserved. 
More specifically, to ensure +p UU [ll VFS 3] < «7, one need to make 


a 


3+7v 
n 4v 


év 


24+ oan 
n v v 
N=O ( 5 steps under the assumption that noise 6 = O (S) 


where v is a Hoélder parameter. For the convergence in the sense of smoothed 
function gradient norm ey UU [IV Siu (x*) 13] < 7, it can be shown 


7-3 sav 
2, : € I+v 

N=O (“es =] with 6 = O (<==) 
€ l+v n 4 
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with functional gap | f(x) — f(x)| = O (e/a). For the case of v = 1 (i.e., 


fec ee these results can be improved to N = O (1/c*) (n times better) achieving 
the same rate of convergence as in the previous paper [189]. 

Such noisy setup is also interesting because it can be shown [211] that for a 
non-convex function 7 (x) s.t. | a (x) — f(x)| < ef, where initial f is convex and 
1-Lipschitz and ¢  ~ max {e°/va, e/n } , there exists an algorithm that finds a point x 
s.t. 7 (x) < a + e with complexity Poly (n, 1), The dependence € ¢ (€) is optimal 
in this class of algorithms. 

This Gaussian smoothing technique was later used in works [102] (RSGF) 
and [104] (RSPGF) to obtain complexity guarantees for stochastic zeroth-order 
optimization. In the first one [102], the unconstrained problem Q = R” is 
considered, where a = F(x,é) st. Es[F@,&)] = f(x) and F(-,&) has a 
Lipschitz-continuous gradient with constant L;, and & is a random variable whose 
distribution P is supported on &; C R”. The procedure ((7)) has a form similar to 
the one proposed in [189] 


Bk k gky_ Pork gk 
xl = xk Gok, ek uy), Gok, eu) ie i" + pee) — F@E") uk, 
LL 


and from E¢[F (x, €)] = f(x), it follows that 


Eu [G(x, g, u)] S V f(x). 


The method then chooses the x* from generated ey ; as k = R where 
R is some random variable with a probability mass function Pr supported on 
{1,..., N}. The main goal to introduce this random iteration count R is to derive 
new complexity results for non-convex stochastic optimization case. 


For the case of f € are smoothing parameter 4, Df = J 26 DP YL, 
variance o2 ( Ue [iv fe. é)-— Vf (x) 3] < o”), and the probability mass function 


hy — 2L(n + 4)hié 


Prk) = = 
Vii — 2L(@ + 4)h7) 


i=1 


they obtain [102, Theorem 3.2] 


1 
-E[IVS@*)13] < 
N N 
D? + 2y2(n+ 4) (1 +L(n +4) > (4 oe Li)) 42(n + 4)o? Ds nh 
Z = = 


= 


[hg — 2L(n + 4)hz] 


k=1 
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where the expectation is taken with respect to R, {E*}. After choosing specific 
constant stepsizes hy = !/[./n+4] + min j!/[4L/n+4], Byovn]} (note that this makes 
Pr uniform on {1,..., N}), they get [102, Corollary 3.3] 


12(n + 4)L D4 pel | 2. . De 
e[iv fee] < ts (54 ‘) 


+ = 
N VN 


al ee 


where D > 0 is our estimation of Df (e.g., some upper bound). It can be shown 
that to ensure P{ IV f(x) II5 <e} > 1-—A (the so-called (€, A)-solution), the total 


number of calls to the oracle f can be bounded as 


Another method that is considered in [102] is a two-phase method (2-RSGF), 
which uses the first one (RSGF) S = log (2/) times as a subroutine producing a list 
of candidates {x*} e ,» and then the output point x* is chosen in such a way that 


T 

1 

= = : zk sky. zk gk 1k 

le@* ya = min eas 8) = gD GEE ul) 
i= 


Then it can be shown [102, Theorem 3.4] that (€, A)-solution will be achieved after 
taking 


nL2D2 log (1/A) _ D2\’ 42 loe2 (1 2 
+ log (l/ L nd? b+ <7 log (yay + ECM (1 4) 
tes 


€ A € 


calls to the f that is better than the previous one in terms of A. 

A more general problem minyegcr: W(x) = f(x) + h(x), where f € ce 
and h(x) is a simple convex and possibly non-smooth function, is considered in 
[104]. They use a mini-batched version of gradient estimation from the previous 
paper [102] and generalized projection obtaining [104, Theorem 4, Corollaries 6-7] 
similar bounds for the gradient norm. 

In [220], the authors use symmetric gradient estimations based on uniform 
distribution over the sphere to build a less dimension-depending method. They 
consider the minimization problem minyeR f(x) = Es[F (x, €)] = elf (x, é)] 
(note that in this chapter, the authors consider both R¢ and R" withd < n), where 
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f(x) is L-Lipschitz, and jz-smooth, | F(x, €)| < §2, and F variance is bounded by 
V;. It was shown that using 


fot taut &) — Fok muh ey 


g(x®, &* u®) := 
2u 


where uk ~ Y (s-*) (uniform distribution on the unit sphere S"—!) and the 


KAT — yk _ ag xk, &* uk) after N steps 


| zp 
DELI re] = 0 (ts + Ts) 


i=l 


process x 


=| 


Now consider the case when for a given &, F(x,&) = g(r(x, 6”), w*) (there 
g(-, w) and r(-, 9) are parameterized function classes), where r(-,0*) : R” + R¢ 
where d < n. To put it simply, the authors consider the case when F(-, €) : R” > 
R, while it is actually defined on a d-dimensional manifold .@ for all &. That means 
that if one knows the manifold (i.e., 9*), and g and r are smooth, the chain rule 
can be applied giving V f(x) = J(x, 6*)V;g(r, W) (where J(x, 0*) = 9r(,6")/ax) 
leading to 


Fock + wdgul, 8) — FO = wdgu’ 69) 
2u 


g(x", é*,u")) ed 
where J, is the orthonormalized J (x*, 6*) anduk ~ Y (se-?), and this gives 


1 ad dae 
WEL se 13] = 0 (sm + =) 


i=l 


which is much better than the previous one (because d «< n). However, this is 
impractical due to the fact that it requires the knowledge of 6*. The authors mix two 
previous estimations and estimate 9 and y on every step, obtaining the method that 
[220, Theorem 1] after N steps ensures 


, 1/2 Radid 1/2 P+ 12 Q7/3 
vee I3] = o(* +" “+o ) 


N NivP Nis 


=| 


i=1 


which is better than the initial bound for d < ni, 

While such gradient estimates based on random directions are common, it can be 
shown that in terms of the number of samples required to the approximate gradient 
to ensure norm condition (or at least ensure it with some probability) random 
directions-based methods lose to standard finite differences [23-25]. In these papers, 
the authors consider an unconstrained optimization problem min,eR: f(x), where 
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f (x) = f(x) + (x) is computable, and the noise ¢ is bounded uniformly: 
le(x)| < ef and f(x) € ce or f(x) € cy (i.e., twice continuously differentiable 
function with M-Lipschitz continuous Hessian). 

The main idea in [25] is to compare the number of calls r (essentially a batch 
size) to the oracle 7 (x) that will be enough to ensure norm condition 


Ilg(x) — VF @)ll2 < OIV F@)ll2, 6 € [0, 1) 


for zeroth-order gradient estimation g(x). This condition simplifies the transition 
from gradient estimations to gradient when proving the convergence of algorithms. 
One of its implications is that g(x) is a descent direction for the function f. In [24], 
the line-search method that uses such gradient approximations, ensuring the norm 
condition, is shown to converge. 

They consider several methods of gradient estimation, deterministic (Forward 
and Central Finite Differences (F FD and CFD) and Linear Interpolation (LJ) 
as generalization) and stochastic (Gaussian Smoothed Gradients (GSG and its 
centered version cGSG), and Sphere Smoothed Gradients (BSG and cBSG)); for 
the latter, the authors obtain the number of calls needed to ensure the norm condition 
with probability 1 — 6. 

Let us take a look at two of these methods: F F D and GSG. For the first one, the 
gradient estimation takes the form 


n 


Ne ics a - fe, 


i=l 


where e; are the columns of J,. It can be shown that for such g(x), the following 
holds 


Ln é. 2ef/n 


IIs@) — VF @)ll2 < — 
ia 


If there was no noise (€ ¢ = 0), we could make this approximation as close to the 
gradient as we want, so we would be able to ensure the norm condition in 7 calls to 
the f . This is also true for a small enough noise (e.g., even from this inequality, we 
can take ef = Ly’/4 obtaining || g(x) — Vf (x) ||2 < wL/n). The authors provide 
such noise bound in the form of lower bound on ||V f(x)||2 for which the norm 
condition can still be ensured 


ef A\lV f(x)|l2 2,/nLe ¢ 
2,;,— <u => — <||V 
7 SBS ee 7 SIV FOO 


In other words, that means that we can converge to the neighborhood where 


IV Flo © 2VnLe s/o. 
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For the GSG, they consider the mini-batched version of Gaussian smoothing 
from [189] 


I 3 fa + me) FO) 
r fu ‘ 


g(x, {u'}) c= ul ~ NV (0, In) 


i=l 


and prove that the norm condition will be ensured with probability 1 — 6 after 


- 3n n Es (n+4) 1 _ 3n 
"* $2 (jn—te 166 8 


calls, which is while linear on n still worse than the plain n in FFD, because 
of 56, and additional constants. However, this is a sufficient number of calls, not a 
necessary, so the authors derive the lower bound for r [25, Section 2.3.1] 


A 43) 


W 


necessary to have probability P(||g(x) — Vf (x)|l2 < @|| Vf) |l2) > 1 — 6. In their 
numerical experiments, they show that to ensure the norm condition with 6 < !/2 
with probability of at least !/2, more than n oracle calls are needed, so this lower 
bound is weak. 

The sufficient lower bound can be improved using smoothing on a sphere for 
which they obtain £2 (”/e? - log [@+/s]), yet it is still worse than deterministic 
variants, and in practice its behavior is very similar to the Gaussian directions-based 
approach. 

There are also results for the case of f(x) € cc (centered versions of the 
estimations), and they can be found in Table 6. 


7.2 Variance-Reduced Zeroth-Order Methods 


One special case of the min f(x) problem is the finite-sum minimization that was 
considered in previous sections for the first-order methods. These problems in 
zeroth-order setup arise in reinforcement learning [94] (there as a minimization of 
a long-term cost that is essentially a sum of functions) and non-stationary online 
optimization problems [280]. 

Let us start with the ZO-SVRG from [167]—a zeroth-order version of SVRG 
from [140]. 

There is a non-convex finite-sum problem of the form 


; _ 1 m 
min f(x) = — 2 fi) 
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where f; € Cy’, ie., |V fix) — VfiQll2 < Lilx — yll2 for any x, y € R” and 
i € {1,...,m} is considered. The authors use the standard assumption that the 
variance of stochastic gradients is bounded 


1 m ‘ 5 
— DIVA) — VAR <0 
i=1 


and consider several different gradient estimates: two based on random directions 
on a unit sphere (in notation of [24], these are BSG with N = | and N = gq 
(see Table 6), called RandGradEst and Avg-RandGradEst, respectively), and one 
deterministic coordinate estimation (variant of CF D from Table 6 with possibly 
different jz; for each direction e; called CoordGradEst) 


RandGradEst : ¥ f(x) = “Efe + uu!) — fiw’, 
LL 


A n q st roe 
Avg-RandGradEst : V fj (x) = — Shi + pul!) — fi(x)u'’, 
Lg * 
j=1 
1 n 
CoordGradEst : V f;(x) = om wire: + jej) — fix — wjes)le; 
j=l 


where i € {1,...,m}, uw > O, and {ej}iny are standard basis vectors (columns of 
Tn). , 


Algorithm 12 ZO-SVRG [167] 


Require: stepsizes {h*}, epoch length 7, starting point x° € R", batch size r > 1, smoothing 


parameter jz > 0, number of iterations N = S - T 
d= x5 =x" 
for s =0,1,2,...,5—1do 

for k=0,1,2,...,7—1 do 


Uniformly randomly pick set J, from {1,...,m} such that |J,| =r 
sk = 1 (VAN — VAG)) + VFGs) 
ielk 
atl ak — akg! 
end for 
Pst = ee = 4x5 
end for 
Pick € uniformly at random from {0,..., N-} 
return xé 
For a mini-batch J C {1,..., m} of size r, the authors denote 


A 1 A 
Vie) = =) V Fi) 


iel 
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and the algorithm is the same as for SVRG (Algorithm 6), with the only difference 
that instead of true gradients update 


xgth = af —hyug, vy = V fu (ts) — Vin Xs) + VEG) 
They use gradient estimations 


kt — yk _ nkgk ok = Vf, (25 — V fr, (x9) + VF®) 
This estimation V f (x?) is no longer unbiased for zeroth-order gradient estimations, 
and that is the main problem for the convergence analysis of this method. They show 
that under assumptions mentioned above, ZO-SVRG algorithm after N = S- T 
(there S is a number of epochs) steps ensures that 


| 12 n dn 
RandGradEst : s[IVe@I3] =0(>+= 
: 


5 
Avg-RandGradEst : LIV F@I3] =6 (5 + — an =) 


CoordGradEst : LIV F@I5] =O (;) 


where n is a dimension, r = |/|—batch size, g is the number of directions used to 
estimate gradient via Avg-RandGradEst, x is uniformly chosen from oo. _ 


N = S.-T isa total number of steps, and 


_ | 1, if J, draws samples from {1,...,m} with replacement 


j(b <n), ... without replacement 


where j(b <n) = lif b <nand j(b <n) = 0 otherwise. 

Basically, that means that CoordGradEst, the deterministic policy of gradient 
estimations, achieves the convergence rates of the original SVRG. In their tests, 
however, in terms of training loss versus function queries, ZO-SVRG (the variant 
without mini-batching and with random directions on the sphere) beats ZO-SVRG- 
Ave (based on Avg-RandGradEst) and ZO-S VRG-Coord (based on CoordGradEst). 

Another above-discussed algorithm that can be used in the zeroth-order finite- 
sum minimization setting is SPIDER [91]. The zeroth-order variant (Algorithm 13) 
of the algorithm blends stochastic and deterministic gradient estimations, using 
mini-batched F F D (Table 6) every p steps to reconstruct v*, which is later updated 
by mini-batched GSG. 

The hy = min (¢/[Lnolv'|l2], \/[2L nol) is a stepsize policy from normalized 
gradient descent (NGD, [185]), where the stepsize is inverse proportional to the 
norm of the gradient. 
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Algorithm 13 SpiderSZO [91] 


Require: no € [1,”'’/6], Lipschitz constant L, epoch length 7, starting point x° € R”, outer 
batch size rj > 1, inner batch size r2 > 1, number of iterations N = S- T 
for k=0,1,2,...,N—1 do 
ifk mod T = 0 then 
Uniformly randomly pick set J, from {1,..., m} (with replacement) such that |J,| = 71 
Cups: Ly lt ok maidf cal 
j=l tel, 
else 
Create set of pairs J, = {(i, u!)} where i uniformly randomly picked from {1,..., m} 


(with replacement) and independent u! ~ .1 (0, I,) such that |J;| = r2 


Compute g§ = > (fete) yi fi ta) hy!) p gkl 
(iu JET 
end if 
xk! & xk — hy ok where hy = min (acta oi) 
end for 
Pick € uniformly at random from {0, ..., N-}\} 
return xé 


The authors show that after N = O (1/e) iterations and O (n min ('/*/c?, I/e3)) 
(where n is a dimension and m is a number of functions) IZO calls (i.e., calls of the 
oracle that returns the value of f;(x) given x and i), this algorithm ensures 


“LIV fx) lla] < 68 


where x is uniformly chosen from Cad are This result is better than what follows 
directly from [189], at least by the factor of m'/? (the direct application of the results 
from [189] requires m calls on every step and gives E[||V f(x)|l2] < ¢ in O (#/c?) 
steps so the number of IZO calls would be O (”/e7)). 

The results of two previously discussed papers [91, 167] were improved in the 
recent work [136]. The authors show that ZO-SVRG-Coord actually has a better 
convergence rate [136, Theorem 2] of [IV £5] = O(!/N) (n times better 
than the previous analysis). At first, they consider an intermediate variant of ZO- 
SVRG-Coord and ZO-SVRG-Ave called ZO-SVRG-Coord-Rand that uses CFD 
and BSG (Table 6) for V f(@s) and V fi (x*) — V f(s) parts of 


== (Phe — HG) +710) 


ie lk 


(from Algorithm 12), respectively, while variants in [167] used only one type 
of gradient estimation at once. Then the authors prove [136, Corollary 1] the 
convergence rate SIV SIS] = O(//n) and show [136, Lemmas 1-2] that 
although the replacement of BSG with CFD requires n more oracle calls, it 
achieves more accurate gradient estimation so the convergence rate stays the same 
for the ZO-SVRG-Coord. 
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Another part of this work is devoted to SPIDER. The authors construct a new 
algorithm (called ZO-SPIDER-Coord) in a way similar to the previous one—they 
use CFD instead of GSG in Algorithm 13 and show that it has the same rate of 
convergence, but with bigger stepsize hy = !/[4L] (that does not depend on ¢), which 
is better in practice. 

One particular case of finite-sum minimization is considered in [280]. In this 
chapter, the authors consider non-stationary online optimization problems, when 
the objective function being queried is time-varying, so one is limited to the use of 
one-point estimators. 

Such estimators can be constructed easily in the stochastic zeroth-order case. For 
example, we can consider GSG (Table 6) with N = 1 and then 


; [fata —f@) ]_. [fe +uw) 
fu(g(x)) = E, u| = ar aa 


u =V 
iL ful&) 


so we can chose g(x) := [f@+#/n]u and obtain a reasonable one-point estimation. 
The problem is that the variance of such estimations explodes as 4 — 0 (see [24]). 
In this chapter, the authors consider the residual feedback estimator 


~ 7k uk k k k-1 k-1 
Be) = (felt + aay — felt + way) 


where ué, uk-! ~ (0, In). They show that (Lemma 2.4) 


LB (x*)] = Vfyn(x*), Wx" € X andk 


(there V f,,,¢ is a gradient of smoothed j;). They consider the online bandit problem 
with regret function 


T-— 


RP = DOE[IIV fue 13] 
k=0 


ee 


and show [280, Theorem 4.2] that for xkt! — [Tx Cay — n&k @*)) (where ITx is the 
projection operator onto set X) if f eC ae 


3272 - ; 
RT =0 0 (Wr Ey WrT-") TY? +n? Lge fT? 
; 


3/2 


f 


and if additionally f € C;’' [280, Theorem 4.3] 


RE = CELI fee 13] = 0 (nP Lower? + nL L5' Wr) 
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where Wr and Wr are constants s.t. 


T 
SS ELfe(x) — fe-1@)] < Wr, YT, x 


T 
Ell — f-r@o?] < Wr, VT, x. 


That bound implies that 87/r > 0 if Wr = 0(T'?) and Wr = o(T). The 
authors also consider [280, Section 5] the stochastic online optimization case where 
f = F;(x, &) s.t. ELF; (x, &)] = f;(x) and show that under the assumptions of the 
same form as above (with W7,< and Wr), similar regret bounds can be achieved. 

In their numerical experiments, the authors compare conventional one-point and 
two-point approaches with one-point residual feedback. Even though the latter 
works worse than the two-point variant, it has lower variance and achieves better 
results than the conventional one-point feedback and can be used in practice, in 
contrast to two-point feedback. 


8 Globalization Techniques 


In the previous sections, we mainly considered guarantees for the methods to 
converge to a stationary point or local extremum. Global performance guarantees are 
available only for some subclasses of non-convex minimization problems. Despite 
that there are several practical techniques for convergence globalization for the local 
methods, which we briefly describe next, following [283]. 


8.1 Multistart Technique 


The first approach involves using an algorithm that converges to a local minimum 
and running it multiple times from different starting points. This may result in the 
algorithm for finding multiple local minima of the objective, some of which might 
in fact be global solutions. 

To be more concrete, we consider the problem 


min f (x). 


xe[0,1]" 


Let the initial points be sampled from the uniform distribution on [0, 1]”. If the 
Lebesgue measure of the attraction basin (the set of points, initialized at which 
the local algorithm converges to the global minimum) of the global minimum is 
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jt > O, then the expected number of points required to find the global minimum is 
= O(1 /). If the attraction basin is a ball of radius r, then u~ ~ r”. Hence, 
it is reasonable to expect that the number of initial points required depends on 
n exponentially. For that reason, this approach to global optimization becomes 
impractical as n grows. 
The effectiveness of this approach also depends on the chosen initial points. The 
quality of a family of initial points }x {x Os yr , can be characterized by the quantity 


.ypm 
dy ({x°'} )= max min 
i=l xe[0,1]! i=l,...,m 
ae 


One of the ways to iteratively generate the starting points {x p= 1S called the 
quasi-Monte Carlo scheme ane low-discrepancy sequences, for example, the Van 
der Corput sequence. Let {p;}?_, be a sequence of distinct prime numbers, and let 


gi (k) be the k-th element of the Van der Corput sequence in base p;. Explicitly, 
Tk, i 
gi(k) = Ya; j DP; jal where /; ; is the length of the representation of k in base p; 
J=0 
Ik, i 
k= > ajp). Finally, set x°* = (@ (k), ..1., bn (kK), k = 1,..., m. In this case, 
j=0 


dn ({x° lee O (vam-'/n In m), while the optimal value, which is achieved at 
the uniform grid, is O (Jam-Ven) 


x= 9: 


8.2 Multidimensional Bisection 


The main shortcoming of the approach described above is that the family {ans ia 1 

is constructed without taking into account any properties of f (x). Assume now that, 

for all x, y € [0,1]” ,|f @) — f @)| < M |ly — x||. Then, for any y, the function 
m 


f(y)—M||x—y|| isa minorant of f (x). Consequently, for any {y*} y—1> the function 
: — f () — M||x — y*|] is also a minorant of f(x). Then one may choose the 


eeieg 


next initial point to be the minimizer of the minorant constructed using the previous 
initial points [90]: 


x0™+l — arg min max x {Ff a) —M |x - xo F 
x k=l, 


In the one-dimensional case, each minorant is just a piecewise linear function, and 
its minimum is easy to compute explicitly. In higher dimensions, this idea is more 
difficult to implement, and the resulting algorithms also tend to become slower as 
n increases. This method also requires an estimate of the Lipschitz constant and is 
sensitive to the accuracy of this estimate. 
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8.3 Langevin Dynamics 


The last but not least approach that we consider in this section is inspired by the 
Langevin dynamics, which is defined by the stochastic differential equation 


dx(t) = —V f (x(t))dt + V2TdW (t), 


where W(t) is a Wiener process (also known as Brownian motion) and T is the 
temperature parameter. It has been shown that the distribution of x (t) converges to 
a distribution with density 


exp (—f (x)/T) 
J exp (—f (»)/T)dy 


ast — oo, andas T — 0+, this distribution concentrates around the global minima. 
To apply this in practice, the continuous dynamics has to be discretized. One of the 
ways to do that is as follows: 


Xkpl = XE — AV f (xn) + V 2hT €x, 


where h > 0 is the stepsize and e, is the standard Gaussian random variable. Non- 
asymptotic results demonstrating the convergence of this method to an approximate 
global minimum were presented in the work [263]. In this chapter, the temperature 
parameter T was assumed to be constant. However, other strategies are sometimes 
used in practice, for example, 


Cc 


Tk =o oA AD 
In(2+k) 


which ensures 7; — 0+ ask — oo. 
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Higher Order Embeddings Mm) 
for the Composition of the Harmonic es ie 
Projection and Homotopy Operators 


Shusen Ding, Guannan Shi, and Donna Sylvester 


Abstract In this chapter, the higher order embedding estimates for the composition 
of the homotopy and harmonic projection operators on differential forms are 
constructed, the higher regularity of this composition is discussed, and some 
applications of the main results are presented. 


1 Introduction 


Differential forms play an important role in many fields, especially in the solution 
of nonlinear partial differential equations. This is due to their advantage of being 
coordinate system independent, see [1, 4, 14] and other references. In this chapter, 
our purpose is to study the higher regularity properties of the composition of the 
homotopy operator T and harmonic projection operator H applied to differential 
forms, both of which are the key operators in exterior analysis and Hodge decompo- 
sition theory. From one point of view, in order to develop the fundamental properties 
of exterior differential forms rigorously, the homotopy operator T is regarded as the 
most powerful tool. The Poincaré lemma is one of the typical applications. On the 
other hand, the projection operator H defined on L?-spaces of differential forms 
has nice properties. For example, the image of the projection operator H is also 
in the harmonic field H, and its L?-Hodge decomposition provides a technique 
to deal with the regularity theory for solutions to boundary value problems, see 
[9, 10, 15, 19], for example. Therefore, we are motivated to establish the higher 
order embedding estimates for the composite operator T o H in terms of the L?- 
norm for a differential form u. If W!:? is the standard Sobolev space for differential 
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forms, we will obtain the following local and global estimates 


L 


ITH) — (TH) allies) < CH(B)' tS? llullp.op, (1) 


and 


L 


IT H(u) — (TH) a llwrscy < CHM)? Illy, (2) 


for every u € Wee (M, A'), where M C R” is a bounded domain, j4(M) is the 
Lebesgue measure of M Cc R"”, ando > | is a constant such that all balls B C 
oBCM. 

In previous work, see Chapter 9 in [1] and [7], the investigation has mainly 
focused on estimates for the L?-norms, the Lipschitz norms, and the BMO norms 
for the composite operator T o H, as well as comparisons between these norms 
with the same integral exponent p (1 < p < oo). For example, Poincaré-type and 
Sobolev embedding estimates for ||7H(u)|lp, in terms of the norm ||u|| p, for a 
smooth form u defined on the set M C R” can be obtained from previous work. H. 
Bi et al. in [3] assert the higher embedding inequality of the homotopy operator T 
applied to the differential /-form u satisfying the following A-harmonic equation: 


d* A(x, du) =0 (3) 


where d is the exterior differential operator, d* is the formal adjoint of d, M Cc R" 
is a domain, and the operator A: M x A(R”) — A(R”) satisfies the monotonicity 
condition: 


|A(x,&)| <alg|?"! and < A(x, €),€ >> EI? (4) 


for almost every x € M and all € € A(R”). Here, a > 0 is a constant and 1 < 
Pp < ©&. However, embedding estimates for ||T H(u)||s, in terms of the norm of 
\|“\lp,a when s > p have not been previously established. The L? regularity of 
operators, functions, and forms has been an important and active focus in analysis, 
see [2, 8, 18]. Here, we try to present an exhaustive study of the higher regularity 
and higher order embeddings of the composite operator T o H. 

In this chapter, we proceed to develop new techniques and combine them with 
methods previously developed by others. We first derive local higher order Sobolev 
embedding inequalities for the composite operator T o H. See Theorems | and 2. 
Then, we establish the global higher integral properties. See Theorems 3 and 4, 
respectively. Finally, in Sect. 4, as applications of our main results, we obtain 
higher order Sobolev embedding inequalities for the homotopy operator T on any 
differential /-form u ¢ W!:?. These results are generalizations of the results proved 
in [3]. 
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We use the following symbols and notation throughout this chapter. Let M Cc R” 
be a bounded domain, n > 2, and B = B(x, p¢) be the ball in R” with radius p 
centered at x, which satisfies diam(o B) = odiam(B). 

For a bounded and convex domain D, the homotopy operator T is a bounded 
linear operator in L? with values in W!?, 1 < p < 00, and defined by 


1 
Tu(x; &) = / eo b(y)u(tx + y — ty; x — y, &)dydt (5) 
0 D 


for all x € D and the vector € = (&,--- ,&), & € R”,i = 0,--- ,/, where the 
function ¢ from Coo (D, A’) is normalized such that ie d(y)dy = 1. See [4] and 
[9] for details about 7. In fact, the definition of T can be extended to any bounded 
domain, see [1]. This definition of T allows one to construct a new notation up as 
follows: 
up = | pele aa ©) 
up=dTu, 1=1,2,-:-,n 


for any u € L?(M, A’), and 1 < p < +o. Specifically, when 1 < p < ov, one 
can obtain the estimate proven in [9]: 


lM Ilp,.p < An(p)H(D) ||" ll pp (7) 


Let A = A(R") = Bp_p A*(R") be a graded algebra with respect to the 
exterior products, where the symbol A‘ = A‘(R”) denotes the set of all k-forms 
defined on R” by u(x) = > Ujy...i, (X)Axj, A+++ A dx;,. We call u(x) a 
1<iy <-++<ig<n 

differentiable k-form if its coefficients u;,...;, are differentiable functions in R”. We 
use D'(M , A*) to denote the set of all differentiable k-forms defined in M. The 
symbol d is the exterior differential operator from D'(M , A*) to D (M, Akt), and 
d* = (—1)"*+! x dx : D'(M, A‘+!) = D'(M, A*) is the formal adjoint of d, 
0 < k < n—1, see [15] and [20] for more descriptions. We use L?(M, A*) to 
denote the classical L?-space for differential forms, 1 < p < oo, equipped with the 
norm 


5 P 
lllp.a = (/, wlPas)’ = (he ma) 


and W!P(M ; A‘) is the classical Sobolev space for differential forms, endowed 
with the norm 


lu lyircny = iam(M))~"Ilullp,w + Vall p,m (8) 
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Analogous to the L?-space, W!:?(M, A*) is a Banach space when | < p <n. For 
appropriate properties, see [12] and [17]. We will use the following Dirac-harmonic 
equation occasionally in this chapter, which was initially introduced by S. Ding and 
B. Liu in 2015 in [5]: 


d* A(x, Du) = 0 (9) 


for almost every x € M, where D = d + d* is the Dirac operator, M Cc R" is 
a domain, and the operator A : M x A(R") — A(R") satisfies the restrictions 
listed in (4). We should point out that the Dirac-harmonic equation is a very general 
equation that includes many existing harmonic equations as a special case, such as 
the A-harmonic equation (3), see [10] for more information. 


2 Preliminary Results 


In this section, in preparation for proving the main results, we present some useful 
lemmas and properties. The constant C(n, p) employed below is independent of the 
differential form u and may be larger from one inequality to the next. We start with 
the definition of H(u), the harmonic projection acting on wu. 

For each fixed k = 1, 2,--- ,n, H denotes the harmonic k-field, i.e., 


H = {u € W(M, A*) : du =d*u =0,u € L?, for some | < p < ov}. 
The definition of the harmonic projection on an L?-space follows that given by C. 
Scott in [15]. 
Definition 1 Given uv € L!(M, A‘), the projection operator H : L? > H is 
defined by letting H(u) be the unique element of H satisfying Poisson’s equation 


A(u) =u — AG(u), (10) 


where G is the Green’s operator and A = dd* + d*d is the Laplace operator. 


From [15], H(u) is normally seen as the harmonic projection of u or the 
harmonic part of u. The projection operator H is a bounded linear operator on 
L?(M, A*), that is, for any u € L?(M, A*), 


HW) lp. < Cllullp,m- (11) 


Remark I We can obtain easily, in the view of Definition 1, that for any u € L?, 
H(u) = (H(u))z is aclosed form satisfying the harmonic equation (9). 


From [4] and [9], there are three valuable properties about the homotopy operator 
T, which will be used repeatedly in this chapter. 


Higher Order Embeddings for the Composite Operators 169 


Lemma 1 Suppose u € Ey AO: A!) is such that du € LP (Q, A), 1 = p< 


oo, 1 = 1,--- ,n — 1. Then d(Tu) is a regular distribution of class Ef XO, A!), 
and 


u = T(du) +d(Tu) (12) 


where Q C R" is a bounded convex domain. 
In particular, for the case | < p < ~™, it has been proven that: 


Lemma 2 For each p > 1, the integral (5) defines a bounded operator 
T: LP(Q, A‘) > WhP(Q, AM), 1=1,2,--- 40 
with the norm estimated by 


ITullwicca) < CM, p)u(Q)|lullp.a (13) 


The results below follow immediately from (13) and (8): 


IVTWMllp.9 < Cu(Q)|lullp,.g and ||Tullp.g < Cu(Q)diam(Q)|lullp.a 
(14) 


Lemma 3 Let u € L?(Q, A!) be a differential form and du € L?(Q, A't!). 
Then, there exists a constant C > 0, independent u, such that u — ug is in 
LPP PG. A!) and 


(n— p)/np 1/p 
(/ lu — uol"”/"-P dx) 2c (/ lua) 
Q Q 


for a cube or a ball Q inR",l=0,1,---,n—1,and1< p<n. 


The following weak reverse Hélder inequality for solutions to the Dirac- 
harmonic equation (9) was proved in [5], which is a generalization of the weak 
reverse Hélder inequality for solutions to the A-harmonic equation. 


Lemma 4 Let u be a solution to the Dirac-harmonic equation (9) in M associated 
with p > 1,0 > 1 beaconstant, and 0 < r,s < & be constants. Then, there exists 
a constant C > 0, independent of u, such that 


llulls,B < Cu(B)o-””" ull os, 


for all cubes or balls B witho B C M. 


With these facts in mind, along with useful observations, we have the following 
four results, Propositions | to 4. 
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Proposition 1 Assume that u is a differential |l-form in Li AM, A!),l = 
1,2,---,n,1 < p <n. Then, forany0 < 5s < = there exists a constant 
C > 0, independent of u, such that 


TH (u) — (TH(u))alls,p < Cu(By "ts? all, oe, (15) 


where B is a ball witho B C M, ando > 1 is a constant. 


Proof For any u € ean Og A!), one can readily see that H(u) € C™(M, AY, 
Observe that (n — p)/np = 1/p—1/n < 1/p,sol < p< ae Then, replacing 


u with T H(u) in the L?-decomposition formula (12), we obtain 


TH(u) =Td(TH(u))+dT(THy)). (16) 
For / = 1,--- ,”, by means of (6), we find that 
dT(TH(u))=(THu))g and dT(H(u)) = (H(u))p. (17) 


Combining (14), (16), and (17), we have 
| 7 H(u) — (TH(u))s\| 2p = |Td(TH(u))I| wp 
<C(n, p)U(B)diam(B)||dT (A (u))I| 2p 


< C(n, p)u(B)'*"/"|(A(w))gll2zp- — (18) 


n—p” 


From Definition 1, we easily conclude that H(u) = (H(u))z is a harmonic form, 
which implies that H(w) is a solution of the Dirac-harmonic equation (9). So, 
using (11) and Lemma 4, it follows that 


| HW) Bllap/(n—p).B = || HW) IInp/m—p).B 
< M(B)" HOI p08 
< C(n, p(B)" llullp.oB: (19) 
where o > | is aconstant. Substituting (19) into (18) yields 
| H(u) — (THU ))allapsin-p),B SCM, P)M(B)|IUllp,oB- (20) 


np ; ; ; ; 
Moreover, for any 0 < 5 < aor applying the monotonic properties of L?-space, it 
follows that 


ITH) — (THW))alls.e < WB) Pt" || TH) —(TH@)) alle. 21 


n—p’ 
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Combining (20) and (21), we finally have 


1ot 


piers E 
TH (w) — (TH (u))glls,p < C(n, p)u(B)'t8 "> |lull pon 


as desired. 


Additionally, by a simple calculation, one can obtain the integral average inequality 
from Proposition |. That is, 


1 1 
(= / ITH) ~(THW)al'dx) < Cy(B)!th (= / jlPadx) 
LB) JB M(B) Jos 


For the next case p > n, we will assert that this result also holds. 


Proposition 2 Assume that u is a differential l-form in LP AM, A!),l = 
1,2,---,n, p => n. Then, for any real number s > 0, there is a constant C > 0, 
independent of u, such that 

TH (uw) — THW)alls,8 < CHB)? lully os, (22) 


where B is a ball witho B C M, ando > 1 is a constant. 


Proof Take 0 = max{1, s/p}andq = ie It is easy to see that 1 < g <n and 
q < p because p > n and 


nOp _ plo(n— p)—n) 


= <0. 
n+6p i n+6p 


q-Pp= 
Using the same technique as in the proof of Proposition 1, we have 


[TH@) — TH@))all 2 2 = Ta TA@))Il op 


= CH, q)u(B)diam(B)||aTH@))|| 202 


< C(n, q)u(B)'** (A (w)) | 


nq B 
n—q’ 


1 
= C(n, g)w(B)'* || Au) || mp 


41 Loi 
< C(n, q)u(B)'*7 (B)t* P| HWI|poB 


ron en 
< C(n, q)u(B)'* 47? lull p.oB (23) 


: nq __ _ nop 
Notice that ao Op > s, because g = on 


the monotonic property of L’?-spaces, we have 


and @ > 1. Once again, according to 


ITH) —(THW))alls.p < WB) 9" |THW) —~(THW)alle p24) 
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Further, substituting (23) into (24) implies 


pipe irri | 
IT Hw) — (TH (u))slls.p < C(n, p)u(B)'*s?** Wall pos. 


The proof of Proposition 2 is complete. 


Before providing further results, we note the equivalence relationship between 
||| p,B and ||u — ugllp,g. That is, for any u € L?(M, A!), if the measure p({x € 
B:|u—up| > O}) 4 0, there exist two non-negative constants C; and C2 such that 


Ci||u — uBllp,B < llellp.B S C2llu — uBllp,s (25) 
Next, we investigate the local higher regularity properties of the composite 


operator T o H. 


Proposition 3 Assume that u is a differential form in LP AM, A!) with | = 
1,2,---,n, 1 < p <n. Then, for any s € (O,np/(n — p)), there is a constant 
C > 0, independent of u, such that 


tel 
TH )|ls.B < Cu(B) "SP llullpos, (26) 


where B is a ball witho B C M, ando > 1 is a constant. 


Proof Let B C M be any ball with oB C M, where o > | is some constant. 
We prove this proposition in two steps: (i) We first assume that the measure j{x : 
|T H(u) — (TH(u))p| > 0} ¥ 0; then according to the inequality (20) in the proof 
of Proposition 1, we obtain 


ITH) — TA) sll we ap SCM, p)u(B)llullp.oe (27) 


From the above comment (25), we see that 


(n—p)/np (n—p)/np 
(/ ja) < C(n, p) (/, lv — val” Pdx 
B B 


(28) 
holds for any differential form v with {x : |v — vg| > O} 4 0. Replacing v by 
T H(u) in (28), we have 


n—p n—p 
n. 


(/ ITHW\*7ax) " <C(n, p) (/ IT H(w) ~ (TH wala) , 
B B 
(29) 
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Combining (27) and (29), we obtain 


|T HW) |Inp/in—p),B SCM, P)L(B)I|lullp,oB- (30) 


Hence, for any s withO < s < = , using the monotonicity property of L?-spaces, 
it follows that 


1oiit 
ITH (u)|ls,B < w(B)s Pt TH@)Il we, 
jz4yi_l 
< C(n, p)w(B) 7"? |lullpos- G1) 


(ii) Now, we consider the case that w{x € B: |TH(u) — (TH(u))p| > 0} = O. 
This means that TH(u) = (T H(u)) z, ae. in B. Therefore, TH (u) is a closed form 
that is a solution to the A-harmonic equation. So we make use of (11), (14) and the 
weak reverse Holder inequality for solutions to the A-harmonic equation to get 


fof 
ITHW)l.8 SCO, PMB)! PITHWMlpos 
< C(n, p)u(B)'t7~ 
oo 


? diam(B)||H(w)llp.oB 
124 
< C(n, p)u(B) 7172? |W) |lp.o8 
1pigi_i 
< C(n, p)u(B)' 7"? ull pon 


np 


n—p? We have 


for any t > 0. Choosing t = 
ITH) || wg SCM, pu(B)llullp.oa 


which is equivalent to (30). As in case i), we may use the monotonicity property of 
the L?-spaces to obtain the desired result for any 0 < s < np/(n — p). The proof 
of Proposition 3 is completed. 


Analogous to the case | < p <n, we have the corresponding result for the case 
pan. 


Proposition 4 Assume that u € LY (M ,_ A!) is a differential form with | = 
1,2,---,n, p =n. Then, for any real number s > 0, there exists a constant C > 0 
such that 


ppl qiit 
|TH(W)||s,B < Cu(B) "SP |lullpos, (32) 


where B is a ball witho B C M, ando > 1 is a constant. 


Proof Similar to the proof of Proposition 3, this proof will be divided into two 
cases: 


(i) In the case p < s < +00, since H(u) is the harmonic part of u, that is, 
dH(u) = d*H(u) = 0,a.e., H(u) satisfies the Dirac-harmonic equation. 
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Thus, by using (14) and Lemma 4, together with the bounded property of the 
projection operator 7 in (11), it is immediate to conclude that 


|T H(u)|ls.B < C(n, p)u(B)diam(B)|| H ()|ls,B 
< C(n, p(B) TV" s-l/P | AW) p08 
SCG, pu rr)? alan, 


~ 
= 


= 


where o > | is a constant. 
(ii) For the case 0 < s < p, the result is quickly obtained by employing the 
technique developed in case (i), which shows that 


|TH(u)|lp.8 < C(n, p)u(B)diam(B)||H(u)|lp.p 
< C(n, p)(B)'*"/" |ullpo- (33) 


Then, we make use of the monotonicity property of L’-spaces to conclude that 
|THW)lls,6 < w(B)'"/ || TH) lp. (34) 
Substituting (33) into (34) yields 
ITHW)lls.B SCO, p)wBy VP ilu poe: 


Thus, for any real number s > 0, the inequality (32) follows. 


We finish this section with the following lemma that we will need in later proofs 
and the application section. 


Lemma 5 [/3] Each domain M has a modified Whitney cover of cubes ¥ = {Qj} 
such that 


U;Q0; = M, 2X fio <Nxmu 


for some N > 1, where x is the characteristic function. 


3 Main Results 


We begin this section with the local Sobolev embedding inequalities. First, we prove 
the following local higher order embedding inequalities for the composition of the 
homotopy operator T and the harmonic projection operator H. 


Theorem 1 Let H be the projection operator and T be the homotopy operator. If 
the differential form u is in LP (M, A!)-space, 1=1,2,---,n, 1 < p <n, then, 


loc 
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or any real number s withO < s < “4, there exists a constant C > 0, independent 
n—p P 
of u, such that 
ziti 
ITH) — (TH(u))sllwiscgy < CHB) *? llullp.os. (35) 


where B is a ball witho B C 922, ando > 1 is a constant. 


Proof First, assume that 1 < s < np/(n — p). Using the decomposition (12), we 
have 


T H(u) —(TH(u))p = Td(TH(u)). 


According to the definition of the norm for w!-P(M, A’) and inequality (13), this 
yields 


|TH(u) — TH))allyis (ay = IPd( TA) |lyis (ay 
< C(n, p)u(B)||dT (A ())Ils,B- (36) 


Then, from the fact (H(u))g3 = H(u), it follows that 
|dTHw)|Is,8 = CH @))alls,8 = |A@)ls,B- (37) 
By Lemma 4 and (11), we have 


|H(w)lls,8 < C(n, p)u(B)'S-!/ |AWllpos 
< C(n, p)u(B)'S-"/? ull pos (38) 


for some o > | and any ball B C M. Combining (37) and (38), we have 
ldTHw)Ils.6 < CQ, p)u(B)'9~"/? |lullpos- (39) 
Substituting (39) into (36) gives 
ITH) — (TH(u))sllwiscay < CQ, p(B) T'S"? lull pos: (40) 


We have proved that (35) holds for 1 < s < np/(n — p). Next, for any ¢ with 
0 <t <:s, from the monotonicity property of the L?-space, we obtain 


TH) — (TH (u))allt.8 < eB)" ||T Aw — (TH())alls,a, (41) 
and 


|V(TH(u) — (TH(u))p)|lt,8 < WCB)!!! ||V(TA(w) — (TH))adlls.B- 
(42 
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From (40), (41), and (42) and using (8), we find that 


| A (u) — (TH (u)) Bllwiecey 
= (diam(B))|||TH(u) — (TH (u))allr.e + VCH) — (TH) a) lee 
< (diam(B))'u(B)'"""" || T H(u) — (TH(u))alls.6 
+u(B)'"' VT A(u) — (TH(u))a)Ils.B 
< (B)""' | TA) = (THW)all wise) 
< C(n, p)u(B)'t'7!/? ull pos. (43) 
The inequalities (40) and (43) indicate that (35) holds for any s withO < 5 < 
np/(n — p). The proof of Theorem | is complete. 


For the case p > n and any s > O, using a method similar to the proof 
of Theorem 1, we have the following higher order embedding inequality for the 
composite operator T o H. 


Theorem 2 Let H be the projection operator and T be the homotopy operator. If 
the differential form u is in L? (M, A!)-space, 1=1,2,---,n, p => n, then, for 


loc 
any real number s > 0, there exists a constant C > 0, independent of u, such that 


ome 
ITH) — TH(u))sllwiscey S$ CHB)? llullpos. (44) 


where B is a ball witho B C 922, ando > 1 is a constant. 


Next, we assert the global higher embedding estimates of the composite operator 
T o H, corresponding to the local results in Theorems | and 2, respectively. 


Theorem 3 Assume that the differential l-form u is in L?(M, A')-space, H is the 
projection operator, and T is the homotopy operator, | = 1,2,---,n,1<p<n. 
Then, T H(u) € W!5(M, A!~!) for any s withO < s < np/(n — p). Furthermore, 
there exists a constant C > 0, independent of u, such that 


iat 
ITH @)llwisan < CHM)'*3~? ull. (45) 
and 


1d 
IT H(w) — (TH(u))allyuscany < Cu(M)'*s~? lull pa (46) 


where M C R" is a bounded domain. 
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Proof We first assume that 1 < s < np/(n— p). Using Proposition 3 and Lemma 5, 
it follows that 


ITA W)llsu < Yo ITAWIls6 
BeV 


Y (Ce, pee)? F jls.08) 


Bev 


IA 


teigi_l 
C(n, p)u(M)'T**3~? Null. 


IA 


1 


1,1 
< C(n, p)w(M)'*"* 5? |lullpa. (47) 
Similarly, from Lemma 5, (14), and (38), we have 


IVT HWlls.a < Yo IVTH@)Ils.2 


Bev 
< }0 CQ, p)u(B)|AW)|ls. 
BeV 
< )0 CQ, p)e(B) 97" llullp,o8 
Bev 
< Cn, p)e(M) TSP lalla. (48) 


Using (8), (47), and (48), and noticing that diam(M) = C(n, p)u(My!/", we find 
that 


ITA (ull ywiscuy = diam(M)"||T Hulls. + VTA) Ils, 


< C(n, p)u(M)'*/5-1/P ull. (49) 


We have proved that (45) holds for 1 < s < np/(n — p). Similar to the proof of 
Theorem 1, using the monotonicity property of the L?-spaces, we see that (45) also 
holds for the case 0 < s < 1. 

Using the fact that TH(u) — (TH(u))y = TdT H(u), Lemma 5, (8), (14), 
and (38), we can prove (46) similarly. The proof of Theorem 3 is completed. 


In analogue of the work in Theorem 3, along with Proposition 4, the result also 
holds for the case p > n. 


Theorem 4 Suppose that the differential l-form u is in L?(M, A!)-space, H is the 
projection operator, and T is the homotopy operator, | = 1,2,---,n, p => n. Then, 
TH(u) € Ww is(M, Al!) for any s > 0. Furthermore, there exists a constant 
C > 0, independent of u, such that 


Ld 
ITH W)llwisay < CHM)'*3? lull. (50) 
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and 
p22 
TH (uw) — (TH W))allytscn < CHM) tS? llullp.a, (51) 


where M C R" is a bounded domain. 


Remark 2 From Theorems 3 and 4, one can obtain two global L?-results corre- 
sponding to Propositions 3 and 4, respectively. Namely, if 1 < p < n, for any 
0 <s < np/(n-— p), we have TH(u) € L*(M, Ae, Moreover, there exists a 
constant C > 0, independent of u, such that 


1z44t_l 
TH (w)|ls,m < Cu(M)' tt 5~? ull pa 


When p > n, the above estimate also holds for any s > 0. 


4 Applications 


In this section, we explore some immediate applications of the higher order 
embeddings of the composition of the homotopy operator T and projection operator 
H. The higher order embeddings and the higher regularity of the homotopy operator 
T that we obtain in this section are extensions of the results proved in [3]. For 
the upcoming applications, we note that the differential form H(u) is of class 
C™(M, A!) for any differential /-form u € L?(M, Aly 1. = p<o,l= 
1,2,--- ,m. Because any differential form u € H satisfies u = H(u), two simply 
direct consequences of Theorems | and 3 are 


| Tu) — (T(u)) Bllwisce) = ||TH(u) - (TH (u))Bllwiscsy 
< Cu(B) tS“! lull op (52) 
and 
IT) — (TW) ullwisayy = ITA) — (TA) lls 
< Cu(My tS“! hall, (53) 


for any 0 < s < np/(n — p) and 1 < p <n, where o > 1 is aconstant such that 
BcCoBC M. Moreover, if the differential /-form u € Wi? (M , A!) is a solution 


oc 
of the Dirac-harmonic equation for example, it suffices to prove the higher order 
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embeddings of the homotopy operator T. Namely, if w{x : |Tu—(Tu)p| > 0} £0, 
from (25), (14) and Lemma 4, one may derive that 
| Tu — (Tu)zlls,p < CM, p)lTulls,B 

< C(n, p)diam(B)u(B)||ulls,B 

< C(n, p)w(B)'*"" alls. 

< C(n, p)e(By YP lull pos (54) 
holds for any s > 1 and p > 1, where o > 1 is aconstant, and B CoB Cc M. 
By the monotonicity property of L’?-spaces, inequality (54) also holds for s > 0. 
In the case of w{x : |Tu — (Tu)p| > 0} = 0, 1., Tu = (Tu)z ae. in B, the 
inequality (54) still holds since Tu — (Tu)g = 0 a.e. for all balls B witha B Cc M. 


With these observations in hand, we can generalize the previous arguments, 
i.e., (52), (53) and (54), to the classical L’?-spaces. 


Corollary 1 Let T be the homotopy operator and u € L? (M, A‘), 1 = 


loc 
1,2,---,n,1 < p <n. Then, for any real number 0 < 5s < a there exists 
a constant C > 0, independent of u, such that 
144-4 
Tu — (Tu)allyiscsy < CHB)? lullp.oB, (55) 


where o > 1 is aconstant and all balls B CoB C M. 


Proof For the case 1 < s < np/(n — p), from the “orthogonal” decomposition 
described in Definition 1, any u € L*, (M, A!) can be decomposed into 


loc 
u(x) = AG(u) + Hu). 
Then, by the linear property of the homotopy operator T, we have 
Tu=ToAoG(tu)+ToH(u) (56) 
and 
(Tu)p = (To QoGu))g + (To Hu))z. (57) 


Combining (56) and (57) with the triangle inequality of the norm, we obtain 


||Tu = (Tu) Bllwiscpy 
< || TAG) — TAGW))allyis(py + ITH) — THU ))allwiscay 68) 
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As a consequence of Theorem 1, we see that 


IT H(u) — (THw))ellywiscs) < CO, p)u(B)'T*? lull poe. (59) 


So, the key point in the following is to estimate the first term on the right-hand side 
of (58). By the decomposition (12), T o A o G(u) can be expressed as 


ToAoG(tu) =Td(ToAoG(u))+dT(T oAoG(u)). (60) 
Using the norm definition (8), the inequalities (7) and (13), it follows that 
|T oA o Glu) — (To Ao Gtu))allyis (py 

= ||T d(T ofAo G(u))|lwis(By 

< Cn, p)u(B)||\dT (Ao G(u))Ils,B 

= C(n, p)u(B)||(A 0 GW))alls,2 

< C(n, p)u(B)"|A 0 Gu)Ils,8 

<= C(n, p)u(M)u(B)||A o GW) Is,8 

< C(n, p)u(B)||A o GW) Ils,B- (61) 
A consequence of Theorem 2.8 in [6] is that 


Ao GW)llr,.8 = |D?GWllr,8 < CM, p)u(BSt"—!/? ull oe (62) 


for any 0 < t < np/(n— p), where o > | isaconstant, and all balls BC oB CM. 
Then, letting | <t = s <np/(n — p), and substituting (62) into (61), we obtain 


|T oA o Glu) — (To Ao Gtu))allwis (py 
< C(n, p)u(By TSN hall, op 


< C(n, p) wy" (By /S-!/P Jul oe 


< C(n, p)u(By tS"? ull oe (63) 


Substituting (63) and (59) into (58) gives (55). For the case 0 < ft < s, using the 
monotonicity property of L?-spaces, it is easy to prove that (55) still holds. The 
proof of Corollary 1 is now completed. 
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Remark 3 The above local Sobolev embedding inequality is only valid for | < p < 
n. In fact, if we repeat the argument in Corollary | and use Theorem 2, we obtain 
the following embedding for the case p > n as well. 


Corollary 2 Let T be the homotopy operator and u € LP UM, A), 1 = 
1,2,---,n, p =n. Then, for any real number s > 0, there exists a constant C > 0, 


independent of u, such that 


1 


141-1 
[Tu — (Tu) sllwiscgy S Cu(B) >? lullp.oa. 


where o > 1 is aconstant and all balls B CoB C M. 


Based on these local results, it is natural to consider global estimates. We 
essentially claim that the global Sobolev embedding inequality is simple to confirm 
by using the Whitney Covering Lemma 5. 


Corollary 3 Suppose that differential l-form u is in L?(M, A')-space, 1 < p <n, 
1 =1,2,---,n, and T denotes the homotopy operator. Then, for any real number 


sé(0, a ), there exists a constant C > 0, independent of u, such that 


1 


14i-l 
Tu — (Tu) alls < CHM) 5? |lullp.m. 


For the case p > n, we can also obtain the following analogous result. 


Corollary 4 Suppose that differential l-form u is in L?(M, A!)-space, p =n, l= 
1,2,--- ,n, and T denotes the homotopy operator. Then, for any real number s > 0, 
there exists a constant C > 0, independent of u, such that 


1 


1¢122 
Tu — (Tu)ullwiscyy < CHM) Ss? |lullp.m. 


In conclusion, we present a specific example for illustration. 


Example 1 Let a = (aj,-::,@,) € R" be a fixed point and 2 = {x = 
(X1,-°+ ,X,) € R” : 0 < |x —a| <r}. Assume that u(x) is a differential 2-form on 
92 defined by 


i 
ee er 


l<i<j<n 


dx; \ dx; 


where r € R is a positive number. 
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Let Sp—1(a) = {(x1,%2,+++ 5 Xn) 2 1 — a1)? + (2 — ay)? +++ + in — an)* = 1} 
be the unit sphere in R”. Then, 2 = (0,r) x S,_1(a). Using the polar coordinates 


in R” and noticing |u(x)| = eae = an we have 


L 
Pp 
lalla = (/ jx)/?ax) 
22 
r 1 P oe 1/p 
=(f (a) oe fa) 
n—1 
( 1 ea). anl2\ 1? 
= r 
n—2p r (3) 


for any p with 0 < p < 5. Here, we have used the fact the surface area of S,_1 = 
2n"/?/I'(F). Hence, u € L?(2, A?) for any 1 < p < n/2. But,u ¢ L?(Q, A’) 
for any p > n/2. Applying Theorem 3, we have TH(u) € L*(2, A) for any s 
withO <5 < = In the meantime, it would be very hard to evaluate the integrals 
(fo ITH) — (TH(u))a|*dx)'”* or (fo \T H(u)|sdx)'"* directly to get the upper 


gp t/2 pn 


bound of the higher order. Notice that the volume of 2 = Tat: Applying (45) 
2 


and (46), we can easily get the global upper bound for the composite operator T o H. 

For a concrete example, take n = 4 and p = 3/2, thenu € L*/?(Q, A’). For 
anyO<s < a e and all balls B C 2, due to the argument in Example 1, 
we derive that TH(u) € W!*5(Q, A). Moreover, recall the property of the Gamma 
function that 


Pin+1)=nlP(n), Pin) =(n—D!, PA) =1, and r(5)=v%, (64) 


Then, it follows that 


n2rt 


Va(r) = 5 


(65) 


Thus, by taking (65) into (45) and (46), respectively, we attain the following 
inequalities for TH (u) 


| 7 A (u)|l ws) < C; “jays et 2/5 24 4/5 < Cor? +4/s 
and 
ITH (u) — (TH(u))allwisyay < CrCl /2) 9 PAF sp P2t4/s < Cyr? t4/s 


Remark 4 (i) Our global results can be extended to larger classes of domains, such 
as L?-averaging domains and L® (z)-averaging domains, see [1] and [16]. (ii) The 
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restricted conditions on uv in our embedding theorems are much weaker than those 
in other embedding theorems proved in the existing papers, see [3] and [11]. In 
particular, in our theorems, we do not require that the differential form wu satisfies 
any version of the harmonic equations. 
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Codifferentials and Quasidifferentials (®) 
of the Expectation of Nonsmooth com 
Random Integrands and Two-Stage 

Stochastic Programming 


M. V. Dolgopolik 


Abstract This work is devoted to an analysis of exact penalty functions and opti- 
mality conditions for nonsmooth two-stage stochastic programming problems. To 
this end, we first study the co/quasidifferentiability of the expectation of nonsmooth 
random integrands and obtain explicit formulae for its co and quasidifferential 
under some natural assumptions on the integrand. Then, we analyse exact penalty 
functions for a variational reformulation of two-stage stochastic programming 
problems and obtain sufficient conditions for the global exactness of these functions 
with two different penalty terms. In the end of the chapter, we combine our results 
on the co/quasidifferentiability of the expectation of nonsmooth random integrands 
and exact penalty functions to derive optimality conditions for nonsmooth two-stage 
stochastic programming problems in terms of codifferentials. 


1 Introduction 


Two-stage stochastic programming is one of the basic problems of stochastic 
optimization [3, 40] that has multiple applications in various fields, including 
transportation planning [2, 31], disaster management [26], optimal design of energy 
systems [49], resources management [28], etc. Although two-stage stochastic 
programming problems can be viewed as stochastic versions of bilevel optimization 
problems [8, 9], their stochastic nature requires a largely different approach to 
their solution. Optimality conditions for two-stage stochastic programming prob- 
lems were obtained in [27, 35, 40, 45, 46], while numerical methods for solving 
various classes of two-stage stochastic programming problems were studied, e.g., in 
[24, 29, 33, 39] (see also the references therein). 
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The need for computing convex or nonconvex subdifferentials of the expectation 
of nonsmooth random integrands arises in many areas of stochastic optimiza- 
tion, including two-stage stochastic programming, as well as stochastic linear 
complementarity problems [6], stochastic variational inequalities [7], etc. The 
subdifferential in the sense of convex analysis of the expectation of a convex 
integrand was computed in [36], while its approximations were discussed in 
[32]. Various approximations of the Clarke subdifferential of the expectation of 
nonsmooth random integrands were studied in [5, 47], while an outer estimate of its 
Mordukhovich basic subdifferential was obtained in [46]. Finally, a quasidifferential 
of the expectation of quasidifferentiable random integrands was computed in [30]. 

The main goal of this chapter is to apply constructive nonsmooth analysis [12, 
14, 15] to a theoretical analysis of nonsmooth two-stage stochastic programming 
problems. Firstly, we analyse the codifferentiability and quasidifferentiability of 
the expectation of nonsmooth random integrands and present explicit formulae for 
its codifferential and quasidifferential in the more general case and under different 
assumptions than in [30] (see Remark 2 for more details). 

In the second part of the chapter, we study exact penalty functions for two-stage 
stochastic programming problems, reformulated as equivalent variational problems 
with pointwise constraints. With the use of the general theory of exact penalty 
functions [11, 16, 20, 23, 38, 48], we obtain sufficient conditions for the global 
exactness of penalty functions for two-stage stochastic programming with two 
different types of penalty terms. The use of penalty terms of the first type leads to 
much less restrictive assumptions on constraints of the second-stage problem, while 
the second type of penalty terms is more convenient for applications. In particular, 
it allows one to reformulate two-stage stochastic programming problems, whose 
second stage problem has DC (Difference-of-Convex) objective function and DC 
constraints, as equivalent unconstrained DC optimization problems and apply the 
well-developed apparatus of DC optimization to find their solutions (cf. analogous 
results for bilevel programming problems in [34, 42]). Let us also note that exact 
penalty functions for single-stage stochastic programming were analysed in [25]. 

Finally, in the end of the chapter, we combine our results on quasidifferentials 
of the expectation of nonsmooth random integrands and exact penalty functions 
for two-stage stochastic programming problems to obtain necessary optimality 
conditions for these problems in terms of codifferentials. 

The chapter is organized as follows. Some auxiliary definitions and facts 
from constructive nonsmooth analysis, which are necessary for understanding the 
chapter, are collected in Sect. 2. Codifferentiability and quasidifferentiability of the 
expectation of nonsmooth random integrands are studied in Sect. 3, while Sect. 4 is 
devoted to nonsmooth two-stage stochastic programming problems. Exact penalty 
functions for such problems are analysed in Sect. 4.1, while optimality conditions 
for these problems in terms of codifferentials are derived in Sect. 4.2. 
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2 Preliminaries 


Let us introduce the notation and briefly recall several definitions from nonsmooth 
analysis that will be used throughout the article. For more details in the finite 
dimensional case, see [12, 14, 15]. The infinite dimensional case was studied in 
[17-19, 21]. 

Let X be a real Banach space. Denote by X™* its topological dual and by (., -) the 
duality pairing between X and X*. The weak* topology on X* is denoted by w* or 
o (X*, X) depending on the context. Denote also by tg the canonical topology of 
the real line R. Let finally U C X be an open set. 


Definition 1 A function f: U — R is called codifferentiable at a point x € U, if 
there exists a pair of convex subsets d f(x), d f(x) C R x X* that are compact in 
the topological product (R x X*, tr x w*), satisfy the equality 

pat. (1) 


max a min 
(a,x*)ed f (x) (b,y*)ed f (x) 


and for any Ax € X satisfy the following condition: 


1 
lim — +aAx)— — max + (x* aA 
a—++0 a fo a) FO) (a,x*)ed f (x) (4 a 7 »)) 


— min (b+(y*, wAx))| =0. 
(b,y*)ed f (x) 


The pair Df(x) = [df (x), df (x)] is called a codifferential of gf at x, the set 
df (x) is referred to as a hypodifferential of f at x, while the set d f(x) is called 
a hyperdifferential of f at x. 


Remark 1 


(i) In the case when X = R®, a codifferential D f(x) is a pair of convex compact 
subsets of Rx R4 = R¢@*! satisfying the equalities from the previous definition. 
In addition, if X is a Hilbert space, then it is natural to suppose that a 
codifferential Df (x) is a pair of convex weakly compact subsets of the space 
Rx X. 

(ii) Note that a codifferential is not uniquely defined. In particular, one can easily 
verify that for any compact convex subset C of the space (R x X*, tr x w*), 
the pair [d f(x) + C, df (x) — C] is acodifferential of f at x as well. 


Definition 2 A function f: U — R is called continuously codifferentiable at a 
point x € U,if f is codifferentiable at every point in a neighbourhood of x and there 
exists a codifferential mapping Df (-) = [d f(-), df (-)], defined in a neighbourhood 
of x and such that the multifunctions df (-) and d Ff (-) are continuous in Hausdorff 
metric at x. 
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The class of continuously codifferentiable at a given point (or on a given set) 
functions is closed under addition, multiplication, composition with continuously 
differentiable functions, as well as pointwise maximum and minimum of finite fam- 
ilies of functions. Moreover, any convex function is continuously codifferentiable 
in a neighbourhood of any given point from the interior of its effective domain, and 
any DC function (i.e., a function that can be represented as the difference of convex 
functions) is continuously codifferentiable in a neighbourhood of any given point. 
Numerous examples of continuously codifferentiable functions and main rules of 
codifferential calculus can be found in [12, 14, 15, 19, 21]. 


Definition 3 A function f: U — R is called quasidifferentiable at a point x € U, 
if f is directionally differentiable at x and its directional derivative f’(x, -) at this 
point can be represented as the difference of sublinear functions or, equivalently, if 
there exists a pair 0 f(x), 0 (x) C X* of compact weak* compact sets such that 


f'(x,h) = max (x*,h)+ min (y*,h) VheX. 
x*ed f(x) y*ed f(x) 


The pair f(x) = [8 f(x), Of (x)] is called a quasidifferential of f at x, the set 
0 f (x) is called a subdifferential of f at x, while the set 0 f(x) is referred to as a 
superdifferential of f at x. 


Just like codifferential, a quasidifferential is not uniquely defined. Here, we only 
mention that a function f is codifferentiable at a point x iff f is quasidifferentiable 
at x and one can easily compute a quasidifferential of f at x from its codifferential 
at this point and vice versa. Namely, if Df (x) is a codifferential of f at x, then the 


pair D f(x) = [Af (x), df (x)] with 


af(x) = [x* © x* 


O.x\edf@}, If) = {y* € x" 


Oy €df@o| 
Q) 


is a quasidifferential of g at x. Conversely, if Df (x) is a quasidifferential of f at 
x, then the pair [{0} x 0 f(x), {0} x af (x)] is a codifferential of f at x (see, e.g., 
[14, 21]). Below we consider only quasidifferentials of the form (2), that is, we 
suppose that if a codifferentiable function f and its codifferential Df (x) are given, 
then Y f (x) is a quasidifferential of f of the form (2). 

Let us finally recall one auxiliary definition from set-valued analysis that will 
be used later (see, e.g., [1, Sect. 8.2] for more details). Let X and Y be metric 
spaces and (£2, 21, 2) be a measure space. A set-valued mapping F: X x 2 3 Y, 
F = F(x, @) is called a Carathéodory map, if for every x € X the multifunction 
F (x, -) is measurable and for a.e. m € 92 the multifunction F(-, @) is continuous. 
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3 Codifferentials of the Expectation of Nonsmooth Random 
Integrands 


Let (92, 21, P) be a probability space, and suppose that a nonsmooth function 
f: R¢ x R™ x 2 > R, f = f(x, y,@) is given. In this section, we study the 
codifferentiability of the nonsmooth integral functional 


I (x,y) =E[ fx, yO, -)] = [ fo. v@,0dPo, 


where x € R@ isa parameter and y € L?(92, 2, P; R”) with 1 < p < +o is 
an m-dimensional random vector. Although the case p = 1 can be included into 
the general theory under some additional assumptions, we exclude it for the sake of 
simplicity, since the proofs of the main results below are much more cumbersome 
in the case p = | than in the case 1 < p < +o. 

Denote by p’ € [1, +00) the conjugate exponent of p, ie., 1/p+1/p’ = 1, 
and let | - | be the Euclidean norm in R”. Let us impose some assumptions on the 
integrand f that, as we will show below, ensure that the functional -% is correctly 
defined and codifferentiable. 

Namely, we will suppose that for a.e. @ € 9 and for all (x,y) € R? x R” 
the function f is codifferentiable jointly in x and y, that is, there exists a pair of 
compact convex sets d, f(x, y, @), dps On y,w) CR x R¢ x R” such that 


® 7 (x, y, o; 0, 0) = We (x, y, a; 0, 0) = 0, 


and for all (Ax, Ay) € R¢ x R”, one has 


1 
bny, —|f(«x +aAx,y+aAy,o) — f(x, y,o) 
—>+0 a 


Qa 


— (x, y,@;aAx,aAy) — We (x, y,@;, a Ax, wAy)| = 0, 


where 
@ p(x, y, @; Ax, Ay) = max (a + (vx, Ax) + (vy, Ay)) 
(a,vx,vy)edy y X.Y.) 
; (3) 
Wr(x, y,@; Ax, Ay) = min (b + (wx, Ax) + (wy, Ay)). 


(b,wx wy )edx,y f(x,y.) 


The pair Dy, y f(x, y,@) = [dy FO, y,@), Gh Gs y, w)] is called a codifferen- 
tial of f in (x, y). 
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Assumption 1 The function f satisfies the following conditions: 


1. 


For any x € R®, the map (y, w) + f(x, y, w) is a Carathéodory function. 


2. The function f satisfies the following growth condition of order p: for any N > 


0, there exist Cy > 0 and a nonnegative function By € Le. 2, P) such that 
| f(x, y,@)| < By(@) + Cyly|? for all x € R@ with |x| < N, all y € R”, and 
ae.@ € §2 inthe case 1 < p < +o, and|f(x, y, @)| < By(@) forae.w@ € 2 
and all (x, y) € R¢ x R” with max{|x|, |y|} < N in the case p = +00. 


. The multifunctions (y, w) d, yf (x, y,@) and (y,@) Gayl O, y, @) are 


Carathéodory maps for any x € R¢. 


. The codifferential mapping D, |, f(-) satisfies the following growth condition of 


order p: for any N > 0, there exist Cy > 0 and nonnegative functions By € 
L!(2, A, P) and yy € L? (2, A, P) such that 


max{|a|, |vx|} < By(@) + Cwlyl?, — lvyl < yw(@) + Cwlyl?! 


for all (a, vx, vy) € d, yf (x, y,@)U dug f y,q@),allx € R¢ with |x| < N, 
all y € R”, and ae. w € Q inthe case 1 < p < +00, and 


max{|a], |vx|, |vy|} < Bu (@) 


for all (a, vy, vy) € dy yf, y,@) U duyt & y,@), ae. @ € 2, and for all 
vectors (x, y) € R¢ x R” with max{|x|, |y|} < N in the case p = +00. 


Note that the Carathéodory and the growth conditions on the function f ensure 


that the value .7(x, y) is correctly defined and finite for all x € R¢ and y € 
L?(2, 2, P; R”). Let X = R¢ x L?(2, A, P; R”). 


Theorem 1 Let 1 < p < +00 and Assumption | be valid. Then the functional % 
is codifferentiable on R¢ x L(2, 2, P; R"), and for any (x, y) from this space, the 
pair DI (x, y) = [dF (x, y), dF (x, y)], defined as 


dJ (x,y) = {(4.2°) —eRx xX* 


A= E[a], 


(x*, (ax, hy) = { BLvalots) +f (uy(o),hy(o)) dPCw) V(hx, hy) € X, 


(a(-), Ux(-), Vy(-)) is a measurable selection of the map d, yf (x, y(-), I (4) 
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and 


B=Eb], 


dF (x,y) = {@, y*) ER x X* 


(y*, (tx, Ay)) = (Ewe, hx) + i (wy(@), hy(w)) dP(w) W(x, hy) € X, 
(b(-), wx(-), Wy(-)) is a measurable selection of the map d, yf (x, y(-), df, 


is a codifferential of Y at (x, y). 


The proof of Theorem | is similar to the proof of the codifferentiability of the 
mapping -4(u) = fe Ff (x, u(x), Vu(x))dx from the author’s papers [18, 22] (here, 
2. © R” is an open set and uw belongs to the Sobolev space). On the other hand, 
Theorem | cannot be directly deduced from the main results of [18, 22]. That is why 
below we present a detailed proof of Theorem 1. It seems possible to prove a more 
general result on the codifferentiability of integral functionals defined on Banach 
spaces that subsumes Theorem | and the main results of [18, 22] as particular cases. 
A development of such general theorem on the codifferentiability of nonsmooth 
integral functionals is an interesting open problem for future research. 

For the sake of convenience, we divide the proof of Theorem | into two lemmas. 


Lemma 1 Let 1 < p < +00 and Assumption | be valid. Then, for any (x, y) € X, 
the sets d-¥ (x, y) and d.¥ (x, y) from Theorem I are nonempty, convex, compact in 
the topological product (R x X*, tg x w*) and satisfy the following equalities: 


max A= min B=0. (5) 
(A,x*)ed F (x,y) (B,y*)ed.F (x,y) 


Proof Fix any (x,y) € X. We prove the statement of the lemma only for the 
hypodifferential d.%(x, y), since the proof for the hyperdifferential d.4(x, y) is 
exactly the same. 

By Assumption I, the multifunction (y, @) > d, , f(x, y, @) is a Carathéodory 
map. Therefore, by Aubin and Frankowska [1, Thrm. 8.2.8], the multifunction 
d, yf (x, y(-), -) 1s measurable, which by Aubin and Frankowska [1, Thrm. 8.1.3] 
implies that there exists a measurable selection (a(-), v;(-), vy(-)) of this mapping. 
Furthermore, by the growth condition on the codifferential D,,, f(-) from Assump- 
tion 1, all measurable selections of the set-valued mapping d, , f(x, y(-), -) belong 
to the space 


Y :=L!'(2,4%, P) x L'(2,4, P; RY) x L? (2,A, P; R”). (6) 
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Consequently, the linear functional x*, defined as 


(x*, (x, hy)) = ( Lula) + f (eye), hy(o))d Pw) V(hx, hy) € X, 


belongs to X*, and one can conclude that the hypodifferential d-7 (x, y) is correctly 
defined and nonempty. 

Denote by &(x, y) the set of all measurable selections z(-) = (a(-), Ux(-), vy(-)) 
of the set-valued mapping d, , f(x, y(-), -). As was noted above, & (x, y) isa subset 
of the space Y defined in (6). For any z = (a, vx, vy) € Y, denote by 7(z) the pair 
(A, x*) defined as in (4). Then, d.4 (x, y) = F(E(x, y)). 

By definition, for ae. w € &, the hypodifferential d, , f(x, y(),@) is a 
convex set. Therefore, the set of measurable selections (x, y) of the multifunction 
d. yf (x, y(), +) is convex. Hence, taking into account the fact that the operator T 
is linear, one obtains that the hypodifferential d- (x, y) is a convex set as the image 
of a convex set under a linear map. 

Recall that by the definition of hypodifferential, one has a < O for any 
(ad, Uy, Vy) € d,, yf, y(w),@), @ € 9. Therefore, A < 0 for all (A,x*) € 
d(x, y). On the other hand, observe that thanks to equality (1) for a.e.@ € Q, 
one has 


Oe {a eR | A(vy, vy) € R4*™: (a, vz, vy) € d, yf (x, y(@), wo). 


Hence, by the Filippov theorem (see, e.g., [1, Thrm. 8.2.10]), there exists a 
measurable selection (ao(-), Vxo0(-), Vyo(-)) of the set-valued map d, yf, y(-), -) 
such that ao(@) = 0 almost surely. Consequently, for (Ao, x5) = 7 (ao, Ux0, Vyo), 
one has Ag = 0, which implies that equality (5) holds true. 

Thus, it remains to prove the compactness of the set d-¥ (x, y) in the correspond- 
ing product topology. To this end, let us verify that the set &(x, y) is a weakly 
compact subset of the space Y defined in (6), and the operator Y continuously 
maps the space Y endowed with the weak topology to the topological product 
(R, tr) x (X*, w*). Then, one can conclude that the hypodifferential d.4 (x, y) is 
compact in the corresponding product topology as a continuous image of a compact 
set. 

We start with the proof of the continuity of the operator 7. Let VY be an open 
subset of the product space (R, tp) x (X*, w*). Let us show that its preimage Y = 
Z-'(V) under the map 7 is weakly open in Y. Indeed, fix any (a, vx, vy) € &. 
Then, (A, x*) = (a, vx, vy) € ¥, which due to the openness of the set V in the 
corresponding topology implies that there exist e > 0,n € N, and pairs (hj, &;) € X, 
i¢é7]={l1,...,n}, such that 


Ie(A, ¥°) = {(B,y) ER x X* 


|B— Al <e,  max|(y*—2*, i, 6))| <e} ¢ Y. 
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Introduce the set 


E(b — a)| <6, 


Ue (a, vx, vy) = {o. Wx, wy) € y 


max| f (wy) — v9(0), hi) APO) 2 


iel 


max| f (wy(@) — vy(@), §\(@)) dP(o)| < at 
ie] Q 2 
This set is neighbourhood of the point (a, v,, vy) in the weak topology on Y. 
Moreover, by definition, 7(%,(a, vx, vy)) G %(A,x*), which implies that 
U-(a, Vy, Vy) CG W. Thus, for any point (a,v,x,vy) € W@W, there exists a 
neighbourhood of this point in the weak topology contained in Y. In other words, 
the set Y is weakly open, and one can conclude that the operator 7 is continuous 
with respect to the chosen topologies. 

Let us finally prove the weak compactness of the set &(x, y) in the space 
Y defined in (6). By the Eberlein-Smulian theorem, it suffices to prove that 
& (x, y) is weakly sequentially compact. To this end, choose any sequence z,(-) = 
(an(-), Urn(-), Vyn()) € @(x, y), n € N. Let us consider two cases. 

Case p = +00. By the growth condition on the codifferential D, , f(-) (see 
Assumption 1), there exists an a.e. nonnegative function 6 € i (2, 2, P) such that 
for a.e. w € §2, one has 


max {lan()|, |Uxn(@)|, |Vyn(o)|} <B(@) WneN. 


Hence, by the weak compactness criterion in L (see, e.g., [4, Thrm. 4.7.20]), the 
closures of the sets {dn}nen, {Vrn}neN, and {Vyn}neN are weakly compact in the 
corresponding L! spaces. Therefore, by the Eberlein—Smulian theorem, there exists 
a subsequence Zn, = (An;,, Uxn,» Vyn,) Weakly converging to some z, in Y. By 
Mazur’s lemma, there exists a sequence of convex combinations {Z,} of elements 
of the sequence z,, strongly converging to z,. Therefore, as is well known, there 
exists a subsequence {Z,,} converging to z, almost surely. 

Note that due to the convexity of &(x, y), one has {Z} C @(x, y), that is, 
Z(w) € d, yf, y(w),@) for ae. @ € @ and all k € N. Hence, taking into 
account the fact that by definition the hypodifferential d,, f(x, y(w), @), w € &, 
is a closed set, one obtains that z,(@) € d, yf, y(@), @) for a.e.@ € 92. Thus, 
Zx € &(x, y), and the set &(x, y) is weakly sequentially compact, which completes 
the proof. 

Case p < +00. By the growth condition on the codifferential Dy f(-) (see 
Assumption 1), there exist C > 0 and a.e. nonnegative functions 6 € L (2, 2, P) 
andy € L?(@, (1, P) such that for a.e. @ € 2 and all n € N one has 


max {|an(@)|, [Uxn(@)|} < B() + Cly@)I?,— vyn(@)| < Y) + Cly@)/? 1. 


194 M. V. Dolgopolik 


Observe that the right-hand side of the first inequality belongs to L!(2, 2, P), while 
the right-hand side of the second one belongs to L? (2, 2, P). Thus, the sequence 
{vy} is norm-bounded in L? (Q, 21, P; R”), which due to the reflexivity of this 
space (note that 1 < p’ < +00, since 1 < p < +00) implies that there exists a 
weakly convergent subsequence {vy,,, }. In turn, the existence of weakly convergence 
subsequences of the sequences {a,} and {vx,} follows from the weak compactness 
criterion in L! (see [4, Thrm. 4.7.20]). 

Thus, there exists a subsequence {zy,} weakly converging to some z, € Y. Now, 
applying Mazur’s lemma and arguing precisely in the same way as in the case p = 
+oo, one can prove the weak compactness of the set &(x, y). 


Denote by || - || p the standard norm on L? (22, 2, P). 


Lemma 2 Let 1 < p < +00, Assumption I be valid, and the sets dF (x,y) and 
d.F (x, y) be defined as in Theorem 1. Then, for any (x, y) € X and (Ax, Ay) € X, 
one has 


1 
lim —|.%¥(x+aAx, ytaAy)—.4 (x, y)— max (A+(x*, a(Ax, Ay))) 
a—>+0 a (Axed F (x,y) 


— min (B+ (y*,a(Ax, Ay)))| =0. 
(B,y*)ed F (x,y) 


Proof Fix any (x, y) € X and (Ax, Ay) € X, and choose an arbitrary sequence 
{an} C (0, +00) converging to zero. For a.e. w € 2 andn € N, denote 


1 
fr) = —(fe + @, Ax, yw) + Mn Ay(@), ©) — f(x, y@), ) 


- ® p(x, y(@), @; Ay, Ax, an Ay(w)) - W(x, y(@), @; A, Ax, a, Ay(o))), 
(7) 


where the functions ®¢ and Wy are defined in (3). By the definition of codifferen- 
tiability, the sequence f,, converges to zero almost surely. Our aim is to prove that 
each term in the definition of f,, belongs to L!(2, A, P) and there exists an a.e. 
nonnegative function p € L!(2, 2, P) such that | f,| < p almost surely. Then, 
by Lebesgue’s dominated convergence theorem, E|[| f,|] — 0 as n — oo. Hence, 
integrating each term in the definition of f,, separately, one obtains that 


. 1 
lim — I (x + Ot, Ax, y + Oty Ay) — I(x, y) 


N>OO Ap 


-[ P(x, YW), @; An Ax, %p Ay(w)) dP(w) 
92 


-[ W(x, y(@), @; an Ax, dn Ay(w)) dP(o)| = 0. 
2 
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Let us check that 


i Pf (x, y(@), @; Ay, Ax, an Ay(w)) dP(@) 
2 


~ se Aes Ax, A 8 
ee es, (x*,an(Ax, Ay))) (8) 


(a similar equality for the min terms involving the hyperdifferentials can be verified 
in the same way). Then, one obtains the desired result. 

Indeed, by definition (see (3)) for any measurable selection (a(-), vx (-), vy(-)) of 
the set-valued mapping d, f(x, y(-), +), one has 


® (x, y(@), @; a), Ax, an Ay(w)) > a(@) + (Ux (@), &y AX) + (vy(@), Ay Ay(@)), 


which implies that 


[orl x@re Mn Ax,dnAy(@))dP(@)> max (A+ (x",anAz)) 
a? (A,x*)ed F (x,y) 


(see (4)). On the other hand, for a.e. @ € $2, one has 
Dy (x, Y(M), @3 Ay, AX, Ay Ay(o)) 


€ [a + (vx, an Ax) + (vy, Ay) | (x, vy) Ed, fC YO), 0) 


Consequently, by the Filippov theorem (see, e.g., [1, Thrm. 8.2.10]), there exists 
a measurable selection (ao(-), vx0(-), Vyo(-)) of the multifunction d, yf, y(-), -) 
such that 


Dy (a Y(@), @; An Ax, an Ay(w)) = 49(@) + (vx0(@), &n Ax) + (Vy0(@), An Ay(@)) 


for a.e. @ € 2. Hence, for the corresponding pair (Ao, xg) = J (ao, Vx0, Vyo) (see 
the proof of Lemma 1), which by definition belongs to d-%(x, y), one has 


i @ p(x, y(@), ©; a Ax, an Ay(w)) dP(w) = Ag+ (xq, an (Ax, Ay)), 
2 


and therefore equality (8) holds true. 

Thus, it remains to show that Lebesgue’s dominated convergence theorem is 
applicable to the sequence { f,,}. Indeed, the first two terms in the definition of f, 
(see (7)) belong to L!(2, A, P) by virtue of the first two parts of Assumption 1. 
Let us check that these terms are dominated by a Lebesgue integrable function 
independent of n. 
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By the mean value theorem for codifferentiable functions [21, Prp. 2] for any 
n € Nand for a.e. w € 2, there exist a@,(w) € (0, a) and 


(0, Uxn(@), Vyn(@)) € dy yf + On (@)AX, y(@) + An(@) Ay(@), w), 


(0, Win(@), Wyn(@)) € dx.y f+ On (@)AX, y(w) + An (@) AY), @) 


such that 


1 
—( f+ Ax, y(o) + an Ayo), 0) — f(x, ¥(@),0)) 


= (Uxn(@) + Wrn(@), Ax) + (Vyn(@) + Wyn(@), AY(@)). (9) 


Put a, = maxy,cN@,. By the growth condition on the codifferential D, |, f (see 
Assumption 1), there exist Cy > 0 and nonnegative functions By € L!(2, 4%, P) 
and yy € L? (Q, A, P) (here N = |x| + a,|Ax]) such that 


max {|vxn(@)|, |Wxn(@)|} < Bv(@) + Cn |y(@) + an (@)Ay(@)|? 
< By(w) + C2? (Iv)? + ab |Ay()!”), 
max {|vyn(@)|, [Wyn (@)|} < yw (@) + Cv |y(@) + an (@)Ay(o) |?! 


for ae. @ € 2 and alln € N in the case | < p < +ox, and there exists By € 
L'(Q, A, P) (here N = max{|x| + a|Ax|, lly lloo + @x|| Ay|loo}) such that 


max {lvxn(@)], |Wxn(@)|, [Vyn(@)I, |wyn(o)|} < Bu(o) 


for a.e. w € 2 and alln € N in the case p = +00. Hence, with the use of (9), one 
obtains that in the case p = +00, the inequality 


—| fe + an Ax, y(o) + an Ayo), 0) — fe, y(o), 0) 
n 

S 2By(@)|Ax| + 2Bn(@)|| AY|loo 
holds true for a.e. @ € £2 and all n € N, which implies that the first two terms 


in the definition of f, (see (7)) are dominated by a Lebesgue integrable function 
independent of n. In the case p < +00, one has 


=| f(x +an Ax, y(o) + an Ay(o), 0) ~ f(x, ¥@), 0) 


< 2(By(w) + Cn2”(Iy@)I? + oly @)I”))|Ax| 


+ 2(yv(o) + Cy2?~"(Iy(@)P! + af "[Ay(@)?-!) Ay]. 
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The right-hand side of this inequality does not depend on n and is Lebesgue 
integrable, as one can easily verify with the use of Hodlder’s inequality and the 
equality p’(p — 1) = p. Thus, in the case p < +00, the first two terms in the 
definition of f, are dominated by a Lebesgue integrable function independent of n 
as well. 

Let us finally check that the third term in the definition of f,,, denoted by 


1 
On (@) = — max (a + (vx, by, Ax) + (Vy, an Ay(w))) 
An (a,vx,vy)Ed, y f(x, ¥(@),@) 


(see (7)), is measurable and dominated by a Lebesgue integrable function inde- 
pendent of n. The fact that the last term (the min term) in the definition of f;, is 
measurable and dominated by a Lebesgue integrable function independent of 7 is 
proved in exactly the same way. 

As was shown in the proof of Lemma 1, the set-valued mapping d,. ,, f(x, y(-), -) 
is measurable. Consequently, the function 6, is measurable by Aubin and 
Frankowska [1, Thrm. 8.2.11]. 

For any w € 2, introduce the function 


But) = max (a + (uz, tAx) + (vy, t Ay(w))). 


(a,vx.vy)edy yf (.y(@).) 


Observe that by the definition of codifferential g,,(0) = 0 (see Def. 1), and for any 
t, At € Randa > 0, one has 


1 
—|gonlt + At) ~ galt) — (az + ve(@Ar))| = 0, 


(a.rg)edgo(t) 
where 
dg.(t) = | (ae, Vg) ERX R| ag =a t+ (vx, tAx) + (vy, LAY(@)) — Bolt), 
Ug = (Vx, AX) + (vy, AY(@)), @, des Yy) Edy yf, YO), @)}. 


The set dg,,(t) is obviously convex and compact. Moreover, note that the equality 
max{dg | (dg, Vg) € dga(t)} = 8w(t) — a(t) = 0 holds true. Thus, the function g,, 
is codifferentiable at every point t € R, and the pair [dg,,(t), {O}] is a codifferential 
of g. at the point r. 

Applying the mean value theorem for codifferentiable functions [21, Prp. 2], one 
obtains that for any n € N and for a.e. wm € 2, there exist a,(w) € (0,a,) and 
(0, Ugn(@)) € d8~(An(@)) such that 


1 
On(@) = a, elon) — 8w(0)) = Vg(w) 
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or, equivalently, there exists (dn (@), Urn(@), Vyn(@)) € dy y 


f (x, y(@), w) such that 
On(@) = (Vxn(@), Ax) + (Vyn(@), Ay(@)) Wn EN. 


Hence, by the growth condition on the codifferential D,,, f (see Assumption 1), 
there exist Cy > O and a.e. nonnegative functions By € L'(2, Q2,P) and yy € 
L? (92, A, P) (here N = ||x||) satisfying the inequality 


\On()| < (By(@) + Culy@)I? Axl + (yw @) + Cwly@)?) Ay) 
for a.e. @ € §2 in the case p < +00 and the inequality 


I@n(@)| < Bn (@)|Ax| + By (@)|| AY@)|loo 


for a.e. @ € 92 in the case p = +00. The right-hand sides of these inequalities are 
Lebesgue integrable and do not depend on n. Thus, the sequence {6,,} is dominated 
by a Lebesgue integrable function, which completes the proof. 


With the use of Theorem |, one can easily obtain sufficient conditions for the 
quasidifferentiability of the functional .%. Recall that X = R¢ x L?(2, 2, P; R”). 


Corollary 1 Let 1 < p < +00 and Assumption | be valid. Then, the functional % 
is quasidifferentiable on R¢ x L(2, 2, P; R"), and for any (x, y) from this space, 
the pair G.F (x, y) = [0-4 (x, y), 0-4 (x, y)], defined as 


04% (x,y) = cs e x* 


(x*, Gaz, hy)) = (Elvz], hs) f (lor, hy(@))dP(@) 
Vihx, hy) € X, (0, vy(-), vy(-)) is a measurable selection of d,. yf (x, y(-), I 


and 


TF (x,y) = {yt eX" 


(y*, (ty, hy)) = (E[wz], Ax) + [vo hy(w))dP(@) 
Vihy, hy) € X, (0, wy (-), wy(-)) is a measurable selection of dy yf, y(-), df, 


is a quasidifferential of Y at (x, y). Moreover, the following equality holds true: 


I(x, ys hx, hy) = - [fo(-.-,@)] (x, y(@); hy, hy(@))dP(@) Why, hy) € X. 
(10) 
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Proof Applying Theorem | and the fact that any codifferentiable function g with 
codifferential Dg(x) is quasidifferentiable and the pair 


dg(x) = [x* € X* 


(0, x*) € dgix)}, Ig(x) = {y* e X* 


0, y*) € dg(x)| 


is a quasidifferential of g at x (see, e.g., [14, 21]), one obtains the required results 
on the quasidifferentiability of the functional 7. 

To prove equality (10), recall that the set-valued mappings d, , f(x, y(-),-) 
and a. y f(x, y(), -) are measurable, as was shown in the proof of Lemma |. 
Hence, with the use of [1, Thrm. 8.2.4], one obtains that the set-valued map- 
pings Oe yd Os y(-), -) and Oxyd y(-),-), defined according to equality (2), are 
measurable as well. Consequently, applying the definition of quasidifferentiability 
and arguing in the same way as in the proof of Lemma 2 (or utilizing the 
interchangeability principle; see, e.g., [37, Thrm. 14.60]), one gets that 


cc eae / ((up, fix) + (vy, hy())) 


( max 
2 (Vx ,Vy)EIy y f (Xx Vx(@),@) 


+ min ((wx, hx) + (wy, hy(@)))) dP) 


(wy Wy )EDx,y Sf (Xx, Vx (@),@) 


for all (hy, hy) € X, which by the definition of quasidifferential of the function f 
implies that equality (10) holds true. 


Remark 2 In the particular case when f does not depend on y, ie., f = f(x, @), 
the previous corollary contains sufficient conditions for the quasidifferentiability 
of the function F(x) = E[f(x,-)]. Quasidifferentiability of this function was 
studied in the recent paper [30] under different assumptions on the function f/f. 
Namely, instead of imposing any growth conditions, in [30], it was assumed that 
all integrals are correctly defined and the function f is locally Lipschitz continuous 
in x uniformly in w. 


Let us finally show that under the assumptions of Theorem 1, the functional 
J (x, y) is not only codifferentiable but also Lipschitz continuous on bounded sets. 


Corollary 2 Let 1 < p < +00 and Assumption 1 be valid. Then, Y is Lipschitz 
continuous on any bounded subset of the space X = R? x L?(2, A, P; R”). 


Proof With the use of the growth condition on the codifferential mapping D,y f (-) 
from Assumption 1, one can readily verify that both multifunctions d.4(-) and 
d.Z(-) are bounded on bounded subsets of the space X. Therefore, by Dolgopolik 
[21, Corollary 2], the functional .¥ is Lipschitz continuous on any bounded subset 
of this space. 
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4 Nonsmooth Two-Stage Stochastic Programming 


Let, as above, (2, 21, P) be a probability space. In this section, we study a general 
two-stage stochastic programming problem of the form 


min S| F(x, )], (11) 


where F(x, @) is the optimal value of the second-stage problem 


min ,y,@). 12 
sens f(x, y, @) (12) 


Here, A C R® is aclosed set, f: R¢ x R” x Q — R is a Carathéodory function, 
and G: R¢ x 2 = R” is a multifunction. We assume that G is measurable, and for 
every w € 2 the multifunction G(., w) is closed. 

Choose any 1 < p < +00, and denote X¥ = R? x L?(2, A, P; R”). By the 
interchangeability principle for two-stage stochastic programming (see, e.g., [40, 
Thrm. 2.20]), problem (11), (12) is equivalent to the following variational problem 
with pointwise constraints: 


min FE] f(x, y(),- 
cmin ELF @ yO. ion 
subjectto x EA, y(m)e€G(x,o) forae. we, 


in the sense that the optimal values of these problems coincide, and if this common 
optimal value is finite, then for any globally optimal solution (x,, y«(-)) of the 
problem (“), the point x,. is a globally optimal solution of problem (11), and for a.e. 
@ € 2 the point y,(@) is a globally optimal solution of the second-stage problem 
(12). Conversely, if x, is a globally optimal solution of problem (11) and for a.e. 
@ € 2 the point y,(@) is a globally optimal solution of problem (12) with x = x, 
such that y, € L?(2, 21, P; R’”), then (xx, yx) is a globally optimal solution of the 
problem (#). 

Since problem (11), (12) and the problem (“) are equivalent, below we consider 
only the problem (“). Our aim is to present several results on exact penalty 
functions for the problem (“), which not only allow one to obtain optimality 
conditions for the original two-stage stochastic programming problem but also can 
be used for design and analysis of exact penalty methods for solving problem (11), 
(12). 
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4.1 Exact Penalty Functions 


Fix any p € [1, +00], and denote by 


Joy= ff. 9@),0dPw 


the objective function of the problem (“). Below we suppose that the functional 7 
is correctly defined on the space X := R4@ x L? (2, %, P; R’”) and does not take the 
value —oo. In particular, it is sufficient to suppose that for any x € R%, there exist 
C > O and an ae. nonnegative function B € LIQ, 21, P) such that | f(x, y, w)| < 
B(m) + Cly|? for ae. @ € 2 and all y € R” in the case p < +00, and for any 
x € R¢ and N > 0, there exists an a.e. nonnegative function By € LI(2, 24, P) 
such that | f(x, y, w)| < By (@) for a.e. @ € 2 and all y € R” with |y| < N. 
Introduce the set 


M= {(, y)eX | y(w) € G(x, w) for ae. w € 2}. 
Then, the problem (#) can be rewritten as follows: 


min YF (x,y) subjectto (x,y)eEMN(A x L?(2,4, P; R”)). 
X, VE 


Let g: X — [0, +00] be any function such that g(x, y) = 0 iff (x, y) € M, and 
let O.(x, y) = 4 (x, y) + cy(x, y). The function ®, is called a penalty function for 
the problem (Y) with c > 0 being the penalty parameter, while the function ¢ is 
called a penalty term for the constrain (x, y) € M. Our aim is to obtain sufficient 
conditions for the exactness of the penalty function ®,. 

Recall that the penalty function ®,; is called globally exact, if there exists c, > 0 
such that for any c > cx, the set of globally optimal solutions of the penalized 
problem 


min @,(x,y) subjectto xeEA (13) 
(x, y)EX 


coincides with the set of globally optimal solutions of the problem (“). The greatest 
lower bound of all such c,, is called the least exact penalty parameter of the penalty 
function ®,. One can verify that the penalty function ®, is globally exact iff there 
exists c, > 0 such that for any c > c, the problem (#) and problem (13) have the 
same optimal value, and the greatest lower bound of all such c, coincides with the 
least exact penalty parameter. See [11, 16, 20, 23, 38, 48] for more details on exact 
penalty functions. 

Let us obtain sufficient conditions for the global exactness of the penalty function 
@®, with the penalty term ¢ defined in several different ways. To this end, we will 
utilize general sufficient conditions for the exactness of penalty functions in metric 
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and normed spaces from [20, 23] and the following auxiliary lemma, which is a 
slight generalization of [20, Prp. 3.13]. 


Lemma 3 Let Y be a normed space, ¥ C Y be a nonempty set, and a function 
F:Y — RU {+00} be such that for any bounded set C C Y there exists a 
continuous from the right function wc: [0, +00) — [0, +00) for which 


[FO — FQ2)| < @c(Ilyi — yall) Vy. 92 € C. (14) 
Then, for any R > 0, there exists a bounded set C C Y such that 
F(y) > inf F@) — wc(dist(y,. F)) Wy € BOO, R) ={z €Y | lIzll < R}. 
(15) 


Proof Denote F, = inf,<g F(z), and fix any R > O and z e€ ¥. By our 
assumption, there exists a continuous from the right function wc such that inequality 
(14) holds true for C = B(O, R + ||z|l). 

Choose any y € B(0, R). If y € F, then inequality (15) trivially holds true. 
Suppose now that y € B(O, R)\ F. Clearly, there exists a sequence {y,} C ¥ such 
that ||_y — yn || > dist(y, F) as n — ov, and the inequalities || y — y,|| < |ly—zll < 
R + |[z|| and lly — ypll = lly — yn+1]l are satisfied for all n € N. By definition, 
{yn} CC, y € C, and F(y,) > F, for alln € N. Therefore, by applying inequality 
(14), one obtains that 


F, — F(y) = Fe — FQn) + FOn) — FO) © FOn) — FQ) < @c (lly — yall) 


for any n € N. Hence, passing to the limit as n — oo with the use of the fact 
that the function wc is continuous from the right and the sequence {||y — y,||} is 
nonincreasing, one gets that inequality (15) holds true. 


Remark 3 Note that if F is Lipschitz continuous on bounded sets, then inequality 
(14) holds true with wc (t) = Let, where Lc is a Lipschitz constant of F on C. In 
this case, the statement of the lemma can be reformulated as follows: for any R > 0, 
there exists L > 0 such that F(y) > F, — Ldist(y, F) for all y € B(O, R). Thus, 
Lemma 3 provides a lower estimate of the decay of the function F relative to a given 
set F. 


We start our analysis of the exactness of the penalty function ®, with the simplest 
case when the penalty term ¢ is defined via the distance function to the multifunction 
G. Denote by .%, the optimal value of the problem (“). 


Theorem 2 Let there exist a globally optimal solution of the problem (), the set- 
valued mapping G have closed images, and 


1/ 
ex, y) = (Eldist(y(), Ge, ))"1) "Vex, y) € X 
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in the case p < +00, and p(x, y) = eS8 SUP, cg dist(y(@), G(x, w)) forall (x, y) € 
X in the case p = +00. Suppose also that the functional Y is Lipschitz continuous 
on bounded sets, and there exists c > 0 such that the set 


{(x,y)e X |x eA, B(x, y) < A} 


is bounded. Then, the penalty function ®, is globally exact. 


Proof Observe that the function ¢ is correctly defined for all (x, y) € X, since the 
multifunction G is measurable. Moreover, g is nonnegative, and g(x, y) = 0 iff 
(x, y) € M. Denote by F¥ the feasible set of the problem (#). Let us show that 


g(x,y) = dist ((x, y), F) Vx € A, ye L?(2, 2, P; R”). (16) 


Indeed, fix any (x, y) € X such that x € A. If g(x, y) = +00, then inequality (16) 
obviously holds true. Suppose now that g(x, y) < +00. Then, in particular, one has 
G(x, w) 4 @forae.@ € 2. 

By our assumptions, the multifunction G is measurable and has closed images. 
Therefore, by Aubin and Frankowska [1, Crlr. 8.2.13], there exists a measurable 
selection z of the set-valued mapping G(x, -) such that 


|y(w) — z(w)| = dist (y(@), G(x, o)) forae.w € Q. 


Let us check that z € L?(2, 2, P; R”). Then, (x, z) € ¥ and 


g(x,y) = lly —zllp =|, ») - @, a = dist(@, »), F), 


that is, inequality (16) holds true. 
To verify that z belongs to the space L”, observe that 


|z(@)| < |y(@)| + |z(@) — y@)| = |y()| + dist (y(@), G(x, o)) 


for a.e. w € 92. The right-hand side of this inequality belongs to L?(, A, P; R”) 
due to the fact that g(x, y) < +00. Therefore, the function z belongs to this space 
as well. 

Thus, inequality (16) holds true. Since the functional .Y is Lipschitz continuous 
on bounded sets, by Lemma 3, for any R > 0, there exists L > 0 such that 


I(x, y) > Ip — List (x, y), Ff) V(x, y) € B(O, R). 


Hence, by Dolgopolik [20, Prp. 3.16 and Remark 15, part (ii)], the penalty function 
@, is globally exact. 
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Remark 4 Note that by Corollary 2, the functional .¥ is Lipschitz continuous on 
bounded sets in the case p > 1, provided the integrand f satisfies Assumption 1. 
In turn, as one can readily verify, the set {(x, y) € X | x € A, B(x, y) < A} is 
bounded for some c > 0, if 1 < p < +00 and one of the following conditions is 
satisfied: 


1. The set A is bounded, and the multifunction G is bounded on A x 2. 

2. The set A is bounded, and there exist C > 0 and Bf € Ce. 2, P) such that 
f(x, y, ) = Cly|? + B() for all (x, y) € A x R” andae.we 2. 

3. The multifunction G is bounded on A x £2, and there exist B € L’ (2, A, P) and 
a function p: [0, +00) — [0, +00) such that p(t) > +00 as t > +00, and 
f(x, y, @) > p(\x|) + B(@) for all (x, y) € R¢*” and ae. w € 2. 

4. There exist C > 0, B € Li: 2, P), and a function p: [0, +oo) — [0, +00) 
such that p(t) > +00 ast — +00, and f(x, y,w) > p(|x|) + Cly|? + Bf) 
for all (x, y) € R¢+” and ae. w € Q. 

5. (2, A, P) is a finite probability space, and mingeg f(x, y,@) > +00 as |x| + 
ly] > -+Foo. 


In the case p = +00, the set {(x, y) € X | x € A, B(x, y) < 4%} is bounded, 
provided the first, the third, or the last of the assumptions above is satisfied. 


In most particular cases, the feasible set G(x, w) of the second-stage problem 
(12) is not defined explicitly, but rather via some constraints. As a result, one usually 
does not know an explicit expression for the penalty term g from Theorem 2, which 
makes this theorem inapplicable to real-world problems, at least in a direct way. In 
some cases, Theorem 2 can still be applied indirectly to reduce an analysis of the 
exactness of a penalty function for the problem (“) to an analysis of constraints 
of the second-stage problem. Let us explain this statement with the use of a simple 
example. 


Example I Suppose that the set-valued map G is defined in the following way: 


G(x,@) = [y € R” 


0 O(x, y,o)}, 


where Q: R? x R” x Q — R’ is a multifunction with closed images. In other 
words, the second-stage problem (12) has the form: 


min f(x, y,@) subjectto O€ Q(x, y,o). 
y 


In this case, it is natural to define 


1/ 
ox, y) = (E[dist(O, OG, y),))"]) "1s p < $00. 
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Then, g(x, y) = Oiff (x, y) € M. Suppose that there exists K > 0 such that 
K dist(0, Q(x, y, w)) > dist(y, G(x,w)) Vx € A, ye R”, aE, 


that is, the function g(y) = dist(0, Q(x, y, w)) admits a global error bound uniform 
for all x € A and w € &@2. Then, 


Dxe(x, y) = F(x, y) + Keg(x, y) = F(x, y) + ew(x, y) 


for all x € A and y € L?(Q, 2, P; R”), where 


1/ 
WO, y) = (Eldist(y), Ge, 1)". 


Therefore, as one can readily verify (cf. [20, Prp. 2.2]), under the assumptions of 
Theorem 2, the penalty function ®, is globally exact, and its least exact penalty 
parameter is at most K times greater than the least exact penalty parameter of the 
penalty function from Theorem 2. 


Let us also point out two simple cases when Theorem 2 can be applied directly, 
that is, the cases when one can write a simple explicit expression for the penalty 
term g from this theorem. Note that Theorem 2 can be applied directly whenever 
the distance from a given point y to the set G(x, w) is easy to compute, e.g., when 
the set G(x, w) is defined by linear or, more generally, convex quadratic constraints. 


Example 2 Let I := {1,...,m}. Suppose that the set G(x, w) is defined by bound 
(box) constraints, that is, 


G(x, o) = [y = (yt, .--1 Ym)? ER” | aj(x, 0) < yj < dix, 0), 1 € | 


for some given functions a; and b;. Let the space R” be equipped with the 2 norm. 
Then the penalty term g from Theorem 2 has the form 


Pp I/p 
g(x,y) = (max 0 yi(@) — bi (x, @), aj (x, @) — yi()} dP(w)) 


in the case 1 < p < +00. 


Example 3 Let G(x,@) = B(z(x,@), R(x, @)) be the closed ball with centre 
z(x, wm) and radius R(x, w). Then, the penalty term g from Theorem 2 has the form 


1/ 
g(x,y) = ( |, max o. Iy(@) — 20%, @)| — RO, @)} dP) 


in the case 1 < p < +00. 
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Observe that the penalty terms from Theorem 2 and the examples above depend 
on the parameter p that defines the space in which one solves the problem (). 
This parameter must be chosen to satisfy the assumption of Theorem 2. 

Under some additional assumptions on constraints of the second-stage problem, 
one can prove the global exactness of the penalty function ®, with a penalty term g 
that does not depend on p. For the sake of simplicity, we will prove this result only 
in the case when the feasible set G(x, w) of the second stage-problem is defined by 
inequality constraints, i.e., it has the form 


G(x, @) = {y eR” 


gi(x,y,@) <0, ier={l,...,6} 


for some functions g;: R¢ x R” x 2 — R. Below we suppose that for each x € R¢ 
the map (y, w) +> gi(x, y, @),i € I, is a Carathéodory function, so that the penalty 
term 


vey = | max {0, gi(x, y, @)} dP(@) (17) 
Q tel 


is correctly defined. Note that g(x, y) = 0 iff (x, y) € M. We will assume that for 
any x € R? and ae. w € §2, the function y  gj(x, y,@), i € I, is quasidiffer- 
entiable, and denote by Fy gi (x, y, w) = [0, gi (x, y, @), Oygi (x, y, @)] its quasid- 
ifferential. Denote also I(x, y,w) = {i ET i gi (x, y, @) = Maxzes Bx(xX, y, @)}. 

Let (Y, d) be a metric space, K C Y be a given set, and g: Y ~ RU {+00} be 
a given function. Recall that for any y € K dom g, the quantity 


gy) = liminf 8(z) ~ gy) 7 8) 
z>y,zeK  d(z, y) 


is called the rate of steepest descent of g at y. If y is not a limit point of the set 
K, then by definition 7 (y) = +00. Recall also that a point y € K MN domg is 
called an inf-stationary point of g on the set K, if a (y) = 0. It should be noted 
that in various particular cases, this inequality is reduced to standard stationarity 
conditions. For example, if Y is normed space, g is Fréchet differentiable at a point 
y € K, and the set K is convex, then 8k (y) = Oiff g’C)[z — y] = Oforallz e K, 
where g’(y) is the Fréchet derivative of g at y. See [10, 11, 43, 44] for more details 
on the rate of steepest descent and the definition of inf-stationarity. 


Theorem 3 Let 1 < p < +00 and the following assumptions be valid: 


1. There exists a globally optimal solution of the problem (). 

2. The functional Y is Lipschitz continuous on bounded sets. 

3. The set Sc(yv) = {(x, y) € X |x € A, ®(x, y) < y} is bounded for some c = 0 
and y > ¥%,, where ®, is the penalty function with the penalty term (17). 
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4. For any x € A, there exists an a.e. nonnegative function L(-) € piaten 24, P) 
such that |gi(x, y1, @) — gi(x, y2, @)| < L(@)|ly1 — yall for all y1, y2 € R%, all 
ié€Tl,andae.we€ 2. 

5. Foralli € I,x € A, and y € L?(2,2, P;R"”), the set-valued mappings 
Oygi(x, y(), +) and Oy8i (x, y(-), -) are measurable. 

6. There exists a > 0 such that for any (x, y) € A x R” and a.e. w € 82 such that 
y € G(x, o), and for alli € I(x, y, @), one can find w(x, y, @) € Oygi(x, y,@) 
satisfying the following condition: 


dist (0, co [a,gitx. y,@)+witx,y,o) |iel(a,y, »)}) > da. (18) 


Then, the penalty function ®, with the penalty term (17) is globally exact, and there 
exists Cx > 0 such that for any c > Cx, the following statements hold true: 


1. (X«, Ye) € Sce(yv) is a locally optimal solution of the penalized problem (13) iff 
(xx, x) is a locally optimal solution of the problem (). 

2. (Xx, Yx) € Sc(y) is an inf-stationary point of the penalty function ®, on the set 
A x L?(@,2, P; R”) iff (Xx, yx) is an inf-stationary point of the functional 4 
on the feasible set F of the problem (). 


Proof Let us show that under the assumptions of the theorem, gr (x, ‘)(y) < -a 
for any (x, y) € X \ F such that x € A and g(x, y) < +00 (here gv (x, -)(y) is the 
rate of steepest descent of the function y } g(x, y) at the point y). Then, applying 
[23, Thrm. 2], one obtains the required result. 

To prove the required estimate for gr (x, -)(y), we first construct a descent 
direction for the function g using condition (18) and then obtain an upper estimate 
for the rate of steepest descent via the directional derivative of g along the 
constructed descent direction. 

Fix any (x, y) € X \ ¥ such that x € A and g(x, y) < +00. Recall that by the 
definition of quasidifferential, one has 


Qi (h, w) = (gi (x, -, @))'(y(@), h) = max —— (v,h) + min (w, h) 


VEI, Si (X,y(),@) WEI, gi(X,y(),@) 


(19) 


(see Definition 3). Applying Assumption 5 and [1, Thrm. 8.2.11], one obtains that 
the function Q; is measurable in w for any h € R”. Moreover, since in the finite 
dimensional case the quasidifferential is a pair of compact convex sets, the function 
Q; is continuous in h for a.e. @ € §2, 1.e., Q; is a Carathéodory function. 
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Let us now prove that the multifunction /(-) := I(x, y(-), +), 7: 2 — {1,..., 2} 
is measurable. Indeed, by definitions for any nonempty subset K C {1,..., €}, one 
has 


mass {o EQ | I(x, 9(@), 0) K # a| 
= o €2 | max x(x, y(@), @) = max gi (x, yo), wo). 


This set is measurable, since the functions g;(x, y(-),-)) are measurable due to 

the fact that the maps (y,@) +> gj(x, y,@) are Carathéodory functions by our 

assumption. Thus, for any subset K C {1,..., 5}, the set I7|(K) is measurable, 

that is, the set-valued map / (-) is measurable by definition (see, e.g., [1, Def. 8.1.1]). 
Introduce the set 


E= {o EQ | max gi (x, y(e), @) > 0}. 
1E 


Note that the set E is measurable, thanks to our assumption that the mappings 
(y,@)  g;(x, y, @) are Carathéodory functions. Moreover, P(E) > 0 due to 
the fact that (x, y) is not a feasible point of the problem (). 

Since the multifunction /(-) is measurable and Q; are Carathéodory functions, 
the set-valued mapping 


H@):= {hr eR” 


Ih] =1, max Qi(h, ©) = min max Qi(z, wo}, w cE, 
ieI(@) |zj=liel(@) 


is measurable by Aubin and Frankowska [1, Thrm. 8.2.11]. Furthermore, this 
multifunction obviously has closed images. Therefore, by Aubin and Frankowska 
[1, Thrm. 8.1.3], there exists a measurable function h,: E — IR” such that 
h,(@) € H(q) for all w e€ E. For any wm € & \ E, define hy(@) = 0. Then, 
h,: 2 — IR” is a measurable function and, moreover, ||, ||» = P(E) > 0. 

From condition (18) and the separation theorem, it follows that for any w € E 
there exists h(w) € R™ with |A(w)| = 1 such that 


(v,h(w)) s-a We € cof d,gi(x, yo), 0) + wi(x, yo), 0) | i € 1(o)}. 


Hence, with the use of (19), one obtains that Q; (Ao), @) < —a forall wm € E and 
i € I(@), which by the definition of h,, implies that 


ae QO; (h,(@), w) 


iel 


(20) 


-—a, ifwee£, 
0, ifw¢ E. 


Thus, the function h,. is the desired descent direction, along which we will evaluate 
the directional derivative of the penalty term ¢. 
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Indeed, denote y(w, a) = maxje/{0, gi(x, y(w) + ah,(@), w)} for alla > 0 
and w € &2. Applying relations (20) and standard calculus rules for directional 
derivatives (see, e.g., [14]), one gets that 


lim 
a—+0 a 


¥(a, a) — ¥(@,0) — | maxjer(w) Qi(hx(@),@) < —a, ifwe E, 
~ lo, ifw ¢ E. 


Applying Assumption 4 and the well-known fact that the maximum of a finite family 
of Lipschitz continuous is Lipschitz continuous (see, e.g., [13, Appendix III]), one 
obtains that there exists an a.e. nonnegative function L(-) € L (2, 2, P) such that 


vio, a) = vo, 0) 


a 


< L(@)|hy(@)| < L(@) Va>0, aewe 2. 


Note also that w(-,0) € 1? (2,2, P), since g(x,y) < +00. Hence, by the 
inequality above, y(-,a@) € L'(@,%, P) for all a > 0. Consequently, applying 
Lebesgue’s dominated convergence theorem and the fact that g(x, y + ah,) = 
ULyr(-, w)], one obtains that 


g(x,y +ah,) — g(x, y) 
Qa 


[ox )] (sh) = im, 


= max Q;(h,(w), @)dP(w) < —aP(E). 
FE iEl() 


Therefore, 


p(x, Zz) — g(x, y) 


y* (x, -)(y) = lim inf 
zy lz — yllp 


/ 
< limint 2222 Fe) — 9G ¥) _ [PGI] Orbe) aPC) _ 
a—>+0 a ||| p lL llp P(E) 


and the proof is complete. 
Remark 5 


(i) Note that by Rockafellar and Wets [37, Crlr. 14.14], the multifunctions 
9, 8i (x, y(-), -) and dy gi(x, y(-), -) are measurable for any measurable function 
y(-), provided for any w € §2 the mapping 0g; (x, -, @) is outer semicontinuous 
and the graphical mapping w +> Graph 0g; (x, -, @) is measurable. 

(ii) In the case when the functions g; are continuously differentiable in y, assump- 
tion (18) is satisfied iff there exists a > 0 such that for any (x, y) € R4+” and 
a.e. @ € 2 such that y ¢ G(x, w), one has 


dist (0, co { Vygi(x, yo) |e TC, y,@)}) >a. 
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This condition can be viewed as a uniform Mangasarian—Fromovitz constraint 
qualification. In turn, in the case when the functions g; are convex in y, 
assumption (18) is satisfied iff there exists a > O such that for any (x, y) € 
R¢+" and ae. w € Q such that y ¢ G(x, w), one has 


dist (0. co {ayai(, y,wo)|iel(x,y, w)}) >a, 


where dy gj (x, y, @) is the subdifferential of the function g; (x, -, w) in the sense 
of convex analysis. 


Remark 6 Let for a.e. @ € &2 the functions (x,y) b f(x, y,@) and (x, y) 
gi(x, y,@), i € I, be DC (Difference-of-Convex), that is, there exist convex in 
(x, y) functions f\(x, y,@), fo(x, y, @), gi1(%, y, @), and g;2(x, y, w) such that 


f(%, y,@) = fit®, y,@)— fox, y,@),  gi(x, Y,@) = gill, y, ©) — gi2(X, y, w) 


for all (x, y) € R4*”,i € I, and ae. w € Q. Then, the penalty function from 
Theorem 3 is DC as well. Namely, one has ®,(x, y) = @} (x, y)— op? (x, y), where 


olay =f (feo) 
2 


+emax {0, gi1(x, y(o), 0) + 2 sats v(o)o)}) dP), 


and 


Pex, y) = i. (AG. yo), 0) +e) giax, Yo), @)) dP) 


iel 


are convex functionals. Therefore, with the use of Theorem 3 and well-known global 
optimality conditions for DC optimization problems, one can easily obtain global 
optimality conditions for the problem (“) and the original two-stage stochastic 
programming problem (cf. [41]). Moreover, under the assumptions of Theorem 3, 
one can apply well-developed methods of DC optimization to find local or global 
minima of the DC penalty function ®,(x, y), which coincide with local/global 
minima of the problem (#). Thus, Theorem 3 opens a way for applications 
of DC programming algorithms to two-stage stochastic programming problems 
(cf. [34, 42]). 
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Let us finally derive optimality conditions for the problem (#) in terms of 
codifferentials. We will derive these conditions by applying standard optimality 
conditions for quasidifferentiable functions to an exact penalty function for the 
problem (). 

For the sake of shortness, we will consider only the case when the set A is convex 
and obtain optimality conditions under the assumptions of Theorem 3. It should 
be noted that one can obtain such conditions under less restrictive assumptions on 
the functional .% and the penalty function ®,, if one considers the so-called local 
exactness of the penalty function instead of the global one (see [11, 20]). Moreover, 
one can significantly relax the assumptions on the constraints of the second-stage 
problem by considering the case p = +00 and utilizing the highly nonsmooth 
penalty term 


(x, y) =esssup | max{0, g(x, y(~), 0)}}.. 
ie] 


we 


However, the price one has to pay for less restrictive assumptions on constraints is 
the reduced regularity of Lagrange multipliers (see the theorem below). Namely, in 
this case, one must assume that the Lagrange multipliers are just finitely additive 
measures. 

For any convex subset K of a Banach space Y and any y € K, denote the normal 
cone to the set K at the point y by Nx (y) = {y* € Y* | (y*,z—y) <OVzeE K}. 


Theorem 4 Let 1 < p < +00, the set A be convex, the feasible set of the second- 
stage problem (12) have the form 


Gtx,o) ={yeR" | gia.y.o) <0, fer ={L...,8] 


for some functions g;: R¢ x R”™ x 2 — R, the function f satisfy Assumption 1, 
and the functions g;, i € TI, satisfy the same assumption. Suppose also that 
Assumptions 1 and 3-6 of Theorem 3 are valid, and (xx, yx) is a locally optimal 
solution of the problem (#) such that (Xx, yx) € Sc(y) for some c > Cx, where Cx 
is from Theorem 3. 

Then, for any measurable selection (0, wx(-), Wy(-)) of the set-valued map- 
ping Ax. yf Xx, 3 yx(-), -) and any measurable selections (0, wai), Wyi(-)) of the 
multifunctions dx. y8i(Xx, Ye), +), i € I, there exist € € L'(Q,2, P; R¢) and 
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nonnegative multipliers 4; € L°(Q2,2, P), i € I, such that E[¢] € —Na(xx), 
eel Ai lloo < Cu, Ai(@) Bi (Xx, Vx(@), ©) = O for ae. w € 2 andalli € I, and 


(0, €(@), 0) € dy y F(X, Ye), ) + O, wx), Wy) 


L 
+ YAO) (dy.y 81s YO.) + O, wri), wyi))) 
i=l 
forae.w€ Q. 


Proof Under the assumptions of the theorem, the functional .% is Lipschitz 
continuous on bounded sets by Corollary 2. Let 


(x.y) = [max (0, 6x,» 0) } dP) V(x, y) EX. 


Then, by Theorem 3, the pair (xx, ys) is a point of local minimum of the penalty 
function ®, on the set A x L?(@, 2, P) for any c > cx, where c, is from Theorem 3. 
Thus, in particular, (xx, yx) is a point of local minimum of the problem 


mn A(x, y= : fo(x, y(@),@)dP(@) st. (x, y)€ AXL?(2, A, P; R”), 
x,y 2 


where fo = f + cx maxje;{0, g;}. The function fo is codifferentiable in (x, y), and 
applying codifferential calculus (see, e.g., [14]), one can compute its codifferential 
and verify that fo satisfies Assumption |. Therefore, by Corollary 1, the functional 
F is directionally differentiable. Applying well-known necessary conditions for a 
minimum of a directionally differentiable function on a convex set (see, e.g., [14, 
Lemma V.1.2]) and Corollary 1, one obtains that 


J Crs es ha hy) = i. [fol-, +, 0) ] (ax, ya(@); hx, hy(o)) dP(o) = 0 


for all (hy, hy) € (A— xx) x L?(2, A, P; R’”). Hence, with the use of the standard 
calculus rules for directional derivatives (see [14, Sect. I.3]), one gets that for all 
such (hy, hy), the following inequality holds true: 


F (sc di Bashy) = fF (LAG 001 Ge Yel as yo) 


+ cq max [gi(-,-,0)]! Gs, Y(0); hx, hy(@))) dP) > 0. 


iel(w) 


where go(x, y, w) = 0 and 


Foo) = {i € 1 U (0} | giles, yx(@), ©) = max {0, ge, ye), 0)}}. 
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Fix a measurable selection (0, w,(-), wy(-)) of the set-valued map dys St (x; 
ys(-), -), for alli € J fix any measurable selections (0, w,;(-), wy;(-)) of the set- 
valued maps d,,y9; (+, Yx(-), -), and denote (wx0(-), Wyo(-)) = 0. Then, by the 
definition of quasidifferential (Def. 3) and equality (2), one has 


1 ( max ((v, + Wx(@), Ay) + (vy + wy(@), hy(@))) 
2 (Ux, Vy)€0 f x, Vx(@),0) 


cq max max ((vaj + wi (0), lx) + (vyi + wyi(O), hy(w)))) dP(w) = 0 
iel(w 


for all (hy, hy) € (A — xx) x L?(2, 2, P; R”), where the last maximum is taken 


over all (vx;, Vyi) € 09; (X*, Yx(W), w). Consequently, one has 


i, aes ((ux, hx) + (vy, hy(@))) dP(@) = 0 (21) 


for all (hy, hy) € (A — xx) x L?(2, A, P; R”), where 


O(w) = Of (Xx, Yx(@), ©) + (Wx(@), Wy(@)) 


+ ¢4.00 {Agi (4, Ys(0), @) + (wri), wyi(@)) | i € To)| 
for anyw € 22. 

Let us show that the multifunction Q(-) is measurable. Indeed, as was pointed out 
in the proof of Corollary 1, Assumption | guarantees that the set-valued mappings 
Of (Xx; Vx), +) and 0g; (Xx, yx(-),-), 7 € IU {0}, are measurable. Hence, with 
the use of [37, Proposition 14.11, part (c)], one gets that the set-valued mappings 
Af x, Yas) + (We), Wy) and 98; (Xe, Ye), J+ (Wri), Wyi)), § € LUO}, 
are measurable as well. 

Arguing in the same way as in the proof of Theorem 3, one can easily check that 
the multifunction 7(-) is measurable, which implies that the set-valued maps 


Agi (Xx, Ye(@), ©) + (Wyi(w), Wyi(@)), if i € TCO), 
Qj (w) := nee 
0, ifi ¢ I(@) 


are measurable for all i € J U {0}. Therefore, by Rockafellar and Wets [37, 
Prp. 14.11, part (b)] and Aubin and Frankowska [1, Thrm. 8.2.2], the set-valued 
map 


c0( LU 2:0) = 00 agile. yO.) + {@ri), wi} 


ic TU{O} 


ic TO} 


is measurable. Hence, applying [37, Prp. 14.11, part (c)], one finally gets that the 
multifunction Q(-) is measurable. 
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Now, arguing in the same way as in the proof of Lemma 2 (or utilizing the 
interchangeability principle; see, e.g., [37, Thrm. 14.60]), one gets that inequality 
(21) is satisfied iff 


max i; ((vx(@), Ax) + (vy(@), hy(@))) dP(@) = 0 (22) 
(vx (@), vy) JQ 


for all (hy, hy) € (A — xy) x L?(Q, A, P; IR”), where the maximum is taken 
over all measurable selections of the multifunction Q(-) (note that at least one such 
selection exists by Aubin and Frankowska [1, Thrm. 8.1.3]). From the definition 
of Q(-) and the growth condition on the codifferentials of the functions f and g; 
from Assumption 1, it follows that the set of all measurable selections of Q(-) is a 
bounded subspace of the space LI(Q, A, P; R2) x L?'(2, 2, P; R”). Therefore, 
inequality (22) can be rewritten as follows: 


fae (‘ord + f (oslo), hyo) dP) >0 (23) 


(v1, ¥2)EQ(xs, Vx) 


for all (hy, hy) € (A — xy) x L?(92, A, P; R”), where 


Qe, Ya) = [(r, v2) ER! x LP, %, P:R") | v1 = Ele], v2 = dy, 
(vx (-), Vy(-)) is a measurable selection of the map Q(-) . 


The set Q(x,, ys.) is bounded due to the boundedness of the set of all measurable 
selections of Q(-). Furthermore, the set D(x,, y,) is convex and closed, since by 
definition Q(-) has closed and convex images. Therefore, Q(x,, y,) is a weakly 
compact convex subset of R? x L?'(2, 2, P; IR”). Hence, taking into account 
inequality (23) and applying the separation theorem, one can easily check that 


Qe, Ye) N({ = Nara} x (0}) #. 


Consequently, by the definitions of 2(xx, y.) and Q(-), there exists a function ¢ € 
L'(2, 2, P:R?) such that E[¢] € —Na(x,) and 


(f(a), 0) € Af (Xx, Yx(@), @) + (Wx(@), Wy()) 
(24) 


+ ¢2.00 {881 (ss Y(0), 0) + (wxi(), wyi(o)) | i € Te)| 
fora.e.w € 22. = 

Let Ey = {@ € 2 | I(w) = J} for any nonempty subset J C [U{0}. The sets FE; 
form a partition of $2. Moreover, these sets are measurable, since the multifunction 
I(-) is measurable. 
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Observe that from (24), it follows that 


(f(w), 0) € Af (Xx, Vx(@), @) + (Wx (W), Wy (W)) 


+6460 {gi (%e, Yu(o), @) + (wri(), wyi()) |i € JI 

for any w € E; and any nonempty J C JU{0}. With the use of the Filippov theorem 
(see, e.g., [1, Thrm. 8.2.10]), one can readily check that the previous inclusion 
implies that for any nonempty J C J U {0}, there exist nonnegative measurable 
functions a? (-), i € J, such that )°;.; a! (w) = 1 and 


(C(@), 0) € Af (Xx, Yx(@), @) + (Wx(@), Wy()) 


+ 4 ) ai () (gi (4, Ys(@), ©) + Wri), wyi(@))) 


ieJ 
for a.e. @ € EJ. For any i € J, define 


cx (w), ifweEy,, i € J (or, equivalently, i € T(@)), 
Ai(w) = 


0, otherwise. 


Observe that by definition, A4;,i € 7, are nonnegative measurable functions such that 
ee Xi lloo S Cx, and A; (@) gj (Xx, Yx(@), @) = 0 for a.e. m € 2, since Aj(w) = 0 


whenever i ¢ I(w), 1.e., gi(Xx, Yx(@), @) < 0. Furthermore, bearing in mind the 
fact that wxo(-) = 0, wyo(-) = 0, and dgo(xXx, yx(w), ) = {O}, one gets that 


(f(@), 0) € Af (Xx, Yx(@), @) + (Wx(@), Wy()) 


+ D7 4i(@) (Agi Xe, Ye), ) + (yG(0), wyi(@))), 


iel 
for a.e. m € §2. Hence, applying equality (2), we arrive at the required result. 


Remark 7 It should be noted that with the use of the codifferential calculus one 
can compute a codifferential of the function fo from the proof of the previous 
theorem, apply necessary conditions for a minimum of a codifferentiable function 
on a convex set [18, Thrm. 2.8] to the functional _%, and then directly rewrite these 
conditions in terms of the problem (“) with the use of Theorem | and an explicit 
expression for a codifferential of fo. However, one can check that this approach 
leads to more cumbersome optimality conditions than the ones from the theorem 
above. It is possible to verify that these conditions are equivalent, but in the author’s 
opinion the proof of this equivalence is more difficult than the proof of the previous 
theorem. That is why we chose to present a simpler but somewhat indirect derivation 
of optimality conditions for the problem (“). 
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Remark 8 Note that in the case when the functions f and g;, i ¢€ J, are 
differentiable jointly in x and y, the optimality conditions from Theorem 4 take 
the following well-known form (cf. [27, 35, 40, 45, 46]). There exist nonnegative 
multipliers A; € L° (2, 2, P), i € I, such that 4;(@)g; (xx, yx(@), @) = 0 for ae. 
@ €2 andalli € J, and 


a Vs f Gs, Ye), -) + So AIOVr Bi (x, Ye), d)x _ n} >0 WeA, 


iel 


Vy f Ox, Ya(@), ©) + > (@)Vy gi (Xx, Yx(@),@) =O forae.we 2. 


ie] 


5 Conclusions 


This work was devoted to an analysis of nonsmooth two-stage stochastic pro- 
gramming problems with the use of tools of constructive nonsmooth analysis 
[14]. In the first part of the paper, we analysed the co/quasidifferentiability of the 
expectation of nonsmooth random integrands and obtained explicit formulae for its 
co/quasidifferentials under some natural measurability and growth conditions on the 
integrand and its codifferential. 

In the second part of the chapter, we obtained two types of sufficient conditions 
for the global exactness of a penalty function for two-stage stochastic program- 
ming problems, reformulated as equivalent variational problems with pointwise 
constraints. The first type of sufficient conditions is formulated for the penalty 
term defined via the L” norm of the distance to the feasible set of the second- 
stage problem, while the second type of sufficient conditions is formulated for the 
penalty term that is independent of p and is defined via the constraints of the second 
stage problems. Although the second type of sufficient conditions is much more 
restrictive than the first one, it is more convenient for applications and derivation of 
optimality conditions. Furthermore, as is pointed out in Remark 6, these conditions 
open a way for the derivation of global optimality conditions and application of 
DC optimization methods to two-stage stochastic programming problems, whose 
second-stage problem has DC objective function and DC constraints. 

Finally, in the last part of the chapter, we combined our results on codiffer- 
entiability of the expectation of nonsmooth random integrands and exact penalty 
functions to derive optimality conditions for nonsmooth two-stage stochastic pro- 
gramming problems in terms of codifferentials, involving essentially bounded 
Lagrange multipliers. 
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On the Expected Extinction Time M®) 
for the Adjoint Circuit Chains Associated ‘i 
with a Random Walk with Jumps 

in Random Environments 


Chrysoula Ganatsiou 


Abstract We study appropriate expressions of the expected extinction time for the 
“adjoint” Markov chains describing uniquely a nonhomogeneous random walk with 
jumps (with step —1 or +1 or in the same position having a right elastic barrier at 0) 
through their unique representations by directed circuits and weights (circuit chains) 
in random environments. 
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1 Introduction 


It is known that random walks are one of the most fundamental types of stochastic 
models formed by successive summation of independent, identically distributed ran- 
dom variables with many applications in various domains, ranging from decision- 
making in the brain to population genetics, rankings systems, dimension reduction 
and feature extraction from high-dimensional data [15]. Usually, they are studied 
from the Markov chain point of view where the random mechanism of spatial 
motion is determined by the given transition probabilities (probabilities of jumps) 
at each state in the non-random (fixed) environment. Although they provide a 
simple conventional model to describe various transport processes in many cases, 
the medium where the systems evolves is highly irregular due to many irregularities 
(defects, fluctuations etc.) known as random environments, which lead to the choice 
of the local characteristics of the motion at random according to certain probability 
distribution. Such models are referred to as random walks in random environments. 
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The definition of these random walks involves two special ingredients: the environ- 
ment (randomly chosen but still fixed throughout the time evolution) and the random 
walk (whose transition probabilities are determined by the environment) [8]. 

Additionally, in various applications, we are led to study Markov chains obtained 
by restricting the motion of a “particle” which performs a random walk by 
introducing barriers. In this case, the corresponding Markov chain having no longer 
independent increments is called a random walk with barriers, while its state space 
is a proper subset of Z. Furthermore, there is a class of random walks formed 
by successive summation of independent random variables that are no longer 
identically distributed but still have independent increments that are no longer 
identically distributed known as nonhomogeneous random walks (in opposition 
to the homogeneous random walks with independent and identically distributed 
increments) that can be investigated from the Markov chain point of view, which 
in general coincides with that for chains with independent increments. We will 
consider the nonhomogeneous random walk with state space S = N, right elastic 
barrier at 0 and transition probabilities given by p;j = 0, if |i — j| > 1, pii-t = 
Gi Pig = Tis Pit = Pir Pit gtr = 1,t = 1, poo = ro, Pol = po = 
1—ro, pi > 0, gi41 > 0,7; => 0,7 => 0, which expresses the movement of a particle 
depending on the time that the particle begins to move. 

It is also known that extinction means the termination of a system. The moment 
of extinction is generally the time when the system entries the absorbing state 
0 although the capacity to recover may have been lost before this point. Since 
the states where the system entries before the extinction may be too many the 
determination of this moment is difficult and is usually done retrospectively [7, 14]. 
A central question in this concept is the probability of ultimate extinction where 
no systems exist after some finite number of transitions. To this direction, the 
determination of a measure known as the “expected time to extinction” also 
called “average extinction time” or “mean survival time” which possesses logical 
properties is very fundamental. The expected time to extinction of a system is 
intimately related to the probability of its occurrence in such a way so as to ensure 
the validity of certain common inference patterns found optimization and similar 
types of evolutionary reasoning [12, 13]. 

In parallel, in recent years, a systematic research has been developed (Kalpazidou 
[9], MacQueen [10], Qian Minping and Qian Min [11], Zemanian [16] and others) in 
order to investigate representations of the finite-dimensional distributions of Markov 
processes (with discrete or continuous parameter) having an invariant measure, as 
decompositions in terms of the circuit passage functions 


— 1, if i,j are consecutive states of c, 
JG, J) = . 
0, otherwise 


for any directed sequence c = (i1,12,...,dy,11) of states called a circuit, v > 1, 
of the corresponding Markov process. This research has stimulated a motivation 
towards the representation of Markov processes through directed circuits and 
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weights in terms of circuit passage functions in fixed or random environments 
as well as the study of specific problems associated with Markov processes in 
a different way. The representations are called circuit representations, while the 
corresponding discrete parameter Markov chains generated by directed weighted 
circuits are called circuit chains. More specifically, let S be a denumerable set. The 
directed sequence c = (ij, i2,...,iy,i1) modulo the cyclic permutations where 
ij,12,...,ly € S,v > 1, completely defines a directed circuit in S. A directed 
circuit may be considered as c = (c(m),c(m+ 1),...,c(m+vu—1),c(m+0v)), 
if there exists an m € Z, such that i} = c(m+0),i2 = c(m+1),...,iy = 
c(m + v — 1), i, = c(m + v), that is a periodic function from Z to S. The values 
c(k) are the points of c, while the directed pairs (c(k), c(k + 1)),k € Z, are the 
directed edges of c. The smallest integer p = p(c) > | satisfying the equation 
c(m + p) = c(m), for all m € Z, is the period of c. A directed circuit c such that 
P(c) = 11s called a loop. (In the present work, we shall use directed circuits with 
distinct point elements). 
Let a directed circuit c with a period p(c) > 1. Then, we may define by 


Pa je= (* if there exists anm € Z such thati = c(m), j =c(m+n),meZ 
Gunes = 


0, otherwise 


the n-step passage function associated with the directed circuit c, for any i,j € 
S,n > 1. We may also define 


Je(i) 1, if there exists an m € Z such that i = c(m) 
ae 
‘ 0, otherwise, 


the passage function associated with the directed circuit c, for any i € S. The above 
definitions are due to MacQueen [10] and Kalpazidou [9]. 

Given a denumerable set S and an infinite denumerable class C of overlapping 
directed circuits with distinct points (expect for the terminals) in S such that all the 
points of S can be reached from one another following paths of circuit edges, that is, 
for each two distinct points i and j of S, there exists a finite sequence c, C2, ..., Ck, 
k > 1, of circuits of C such that 7 lies on cy and j lies on cy and any pair of 
consecutive circuits (Cy, Cr+1) have at least one point in common. We may assume 
also that the class C contains, among its elements, circuits with period greater than 
or equal to 2. With each directed circuit, let us associate a strictly positive weight we 
which must be independent of the choice of the representative of c, that is, it must 
satisfy the consistency condition weor, = We, k € Z, where t, is the translation of 
length k. 
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For a given class C of overlapping directed circuits and for a given sequence 
(We)cec Of weights, we may define by 


we JOGA 
ceC 


Pj =§ ————_ (1) 
So we - Jeli) 


ceC 


the elements of a Markov transition matrix on S, if and only if > We: Jc(i) < 00, 


ceC 
for anyi € S. This means that a given Markov transition matrix P = (pjj),i, j € S, 


can be represented by directed circuits and weights if and only if there exist a 
class of overlapping directed circuits C and a sequence of positive weights (we)cec 
such that the formula (1) holds. In this case, the representation of the distribution 
of Markov process (with discrete or continuous parameter) having an invariant 
measure as decomposition in terms of the circuit passage functions is called circuit 
representation, while the corresponding discrete parameter Markov chain generated 
by directed circuits is called circuit chain with Markov transition matrix P given by 
(1) and unique stationary distribution p (a solution of p - P = p) defined by 


pli) = we: Jeli), i € S. 


cecC 


It is known that the following classes of Markov chains may be represented uniquely 
by directed circuits and weights: (1) the recurrent Markov chains [11] and (ii) the 
reversible Markov chains. 

Taking into account the importance of the study of extinction for different classes 
of certain processes in general and by following the context of the theory of 
Markov processes circuit representation, the present work arises as an attempt to 
find suitable expressions of the expected extinction time, that is, of the mean first 
passage time to the state 0 starting at state k, k € Z, for the corresponding “adjoint” 
Markov chains (circuit chains) describing uniquely by directed circuits and weights 
a nonhomogeneous random walk with jumps (with step —1 or +1 or in the same 
position having a right elastic barrier at 0) in random environments. Since for any 
such Markov chain with an absorbing state, the extinction usually occurs as the 
time t — oo with unit probability, this will give a new perspective in the study of 
problems associated with random walks. For an analogous study of the discrete-time 
birth—death circuit chains as a special case, you may see Ganatsiou [3, 5, 6]. 

The work is organized as follows. In Sect. 2, we present some auxiliary results in 
order to make the presentation of the paper more comprehensible. In particular, 
in Sect.2, a nonhomogeneous random walk with jumps (with step —1 or +1 
or in the same position having one right elastic barrier at 0) is considered, and 
the unique representations by directed circuits and weights of the corresponding 
adjoint Markov chains (circuits chains) are studied in fixed, random environments. 
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These representations will give us the possibility to find suitable expressions of the 

expected extinction time, that is, of the mean first passage time to the origin, for the 

abovementioned circuit chains in random environments, as it is described in Sect. 3. 
Throughout the chapter, we shall need the following notations: 


N10, 102). We 1,2. Det AL 
Z* = {1,2,3,..); 2 ={..., —2,-1}. 


2 Circuit and Weight Representations of the Adjoint Circuit 
Chains Associated with a Random Walk with Jumps in 
Fixed, Random Environments 


2.1 Fixed Environments 


Let us consider the Markov chain (X,)nen on N(X, expresses the location of a 
walker at time n,n € N), which describes the nonhomogeneous random walk with 
jumps having a right elastic barrier at 0, with transitions k > (k+ 1),k > (k-—1) 
and k — k, in a fixed environment, whose elements of the corresponding Markov 
transition matrix (transition probabilities) are defined by 

P(Xn41 = 0/Xn = 0) = ro, 

P(Xn41 = 1/Xn =9) = po, po=1—r, 

P(Xng1 =K+1/Xn =k) = Px, k>1, 

PUXny1 =k/Xn =kK)=re, k= M1, 

P(Xng1 =k-1W/Xn HH =q, k=l 


such that px + qx +re = 1, pe > 0, Gk41 > 0,7~ = O, for every k € N, as it is 
shown in Fig. 1. 

Assume that (px)gen and (rg) gen are arbitrary fixed sequences with 0 < po = 
l—ro < 1, pe > 0, dee > 0, ry = 0, for every k € N. If we consider the 
directed circuits cy, = (k,k + 1,k), é = (k,k), k € N and the collection of 


Po P; P2 Pea 
OSD ORD OMD 2 
G ©) 7 @ © 

qd aq © 4 q 

I, tr T, Ta 


Fig. 1 The Markov chain (X,,),en (fixed environments) 
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weights (we, )ken and (we, )ken, respectively, then we may obtain the corresponding 
transition probabilities 


P << 
k — 
Wex_ | + We, + bal 
with 
Wey 
Po= 
Wey + Ww 
= Wex_1 = et 
dk = > Tk= 
Wey + Wey + Wer Weg_y 1 Weg 1 Wel 
uh 
such that px + qx + rx = 1, for every k > 1, with ro = 1 — po = = 3. 
cO z 


0 
Here, the class C(k) contains directed circuits cg, = (k,k+1,k), Cy = (k, k) and 
Ce-1 = (K—1,k, k — 1). 
Equivalently, the transition matrix P = (p;;) with 
Wy 3 
ken Wey * IG Jd) 
Yen [ we, : Jo, i) a Wer 7 Jo (| 
Lge 8 
Deen We, IG I 


Dey [ we. JO + We, Ju 


Pij = ; fori # J) (2) 


(3) 


Pii = 


where iG. J) = lifi and j are consecutive points of the circuit cx, Jo, (i) = 1 if 
i is a point of the circuit c, and a (i) = 1 if i is a point of the circuit ee: expresses 
the representation of the Markov chains (X;,)nen by directed circuits and weights. 
Furthermore, let us consider also the “adjoint” Markov chain (X,,)nen on N 
whose elements of the corresponding Markov transition matrix are defined by 


P(X 4 =0/X,=0) = 7, 

PA ga lx avag. aya lar, 
P(X g= hal x ala p, ke 
Pe gas Shar. be, 
P(Xi.¢=Sk+ 1X, Sk) HG, k= 1, 


nn 


such that P + dy + a = 1, Bev > 0, dy > 0, a > 0, for every k € N, as it is 
shown in Fig. 2. Assume that (q,) keN and (r.)ken are arbitrary fixed sequences with 
0< I =l1- i <1, Diaas dp» ie > 0, for every k € N. If we consider the directed 
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Fig. 2. The “adjoint” Markov chain (X,, )nen (fixed environments) 


circuits c, = (k+1,k,k +1), c, = (kk), & € N and the collection of weights 
(W.” )keN, (W.” )keN, respectively, then we may have that 
k k 


/ Cr 
qk = 
Wi + Wo + +w mt 
k 
with 
Ww 
/ 
-_ 0 
40 was tw 
& 
and 
, Ne i / tat 
Ls aeryy twrtwo’ = oi +wrtwu 
Ch-1 Ck Ck Ch-1 Ck Ck 
/ I / / / ed 
. C 
such that p, + 4, +1; = 1, for every k > 1, with ry = 1— gy = 53, - 
“0 «= “0 


Here, the class C ‘(k) contains the directed circuits Cy = (k+1,k,k+1), Gea — 


(k,k —1,k) and a. = (k,k). As a consequence, the transition matrix P= ( P;;) 
with elements equivalent to that given by the abovementioned formulae (2) and 
(3) expresses also the representation of the adjoint Markov chain (X' nen by the 
directed circuits and weights. Consequently, we have the following: 


Proposition 1 The Markov chain (Xn)jen defined as above has a unique represen- 
tation by directed circuits and weights. 


Proposition 2 The “adjoint” Markov chain (X nen defined as above has a unique 
representation by directed circuits and weights. 


For the proofs of the above propositions, see Ganatsiou [1, 2, 4]. 
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2.2. Random Environments 


Let us consider the random walk on Z with transitions k > (k — 1),k > (k+ 1) 
and k — k whose transition probabilities (px)xez and (rx) rez constitute stationary 
ergodic sequences. A realization of these stationary ergodic sequences is called 
a random environment for this random walk. In order to investigate the unique 
circuit and weight representation of this random walk in random environments, for 
almost every environment, let us consider a probability space ({2, ¥, j4), a measure 
preserving ergodic automorphism of this space m : $2 — {2 and the measurable 
functions p : £2 > (0,1), 7r : 2 — (0,1) such that every m € £2 generates the 
random environment pr = p (m‘w), rkh=r (m‘w), k € Z. Since m is a measure 
preserving and ergodic, the sequences (px)xez and (rx)ez are stationary ergodic 
sequences of random variables. 

Let also S = Z be the infinite product space with coordinates (Xy)nez. Then, 
we may define a family (P®)»<q@ of probability measures such that, for every 
@ € §2, the sequence (X;,)nen forms a Markov chain on Z whose elements of 
the corresponding Markov transition matrix are defined by 


Pg Oi 1; 
P?(Xna1 =k+1/X, =k) = p(m*o), 
P "(Gat = h/ Xn = Srna), 


P?(Xnat =k —1/Xn =k) = 1 — p(m*o) — r(m‘o) = qim*o), k € Z, 


as shown in Fig. 3. We have the following (Ganatsiou [4]): 


Proposition 3. For yu almost every environment w € 92, the chain (Xy)nen has a 
unique circuit and weight representation. 


Proof Following an analogous way of that given in Sect. 2.1, let us consider the set 
of directed circuits cy, = (k,k + 1, k) and oo = (k,k), for every k € Z, since only 
the transitions from k tok + 1,k tok — 1 and k to k are possible. There are three 
circuits through each point k € Z: cp—1, cx and Ce 

The problem we have to manage is the definition of the weights of the circuits. 
We may symbolize by w,(@) the weight of the circuit cy, and by W (w) the weight 


of the circuit Gs for every k € Z. For the definition of weights, let us consider the 


1(m°e) 1(mo) 1(m‘'o) 1(m%e) 1(m'e) 1(m7) r(m*o) 


QQ pmo CY pim?oy CQ vim'or CY pmo CX vim'ay CY poa'oy CQ 


q(m"e) q(m"@) q(m'e) q(m a) q(m*o) q(m'*o) 


Fig. 3. The Markov chain (X;,),-n (random environments) 
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sequences (bg (@)) kez and (7% (@)) kez defined by 


we() w;,(o) 


by(@) = ——_~,_ wea) = , keZ. 
W-1(@) w,_1(@) 
As a consequence, we may have 
p(mkw) pim‘o) _p , 
belo) = 5 = = —(m'o) (4) 


— p(mkw) —r(mkw) — q(mkw) og 


r(m‘o) p(m*!o) 


Vk(@) = - by(@), for every k € Z. (5) 


r(mk-1@) - pimkw) 
Given the stationary ergodic sequences (px)xez and (rx¢)gez, for which every 
@ € §2 generates the random environment px = pim*‘a), rk = r(m‘o), k € Z, we 
have that the preceding Eqs. (4) and (5) give a unique definition of the sequences 
(DK (@)) kez, (Ve (@)) kez, for x almost every w, by the ergodicity of m. Then, the 
sequences of weights (wx (@))xez and (w, (@))xez, are defined uniquely by 


Wk(@) = Wo(@) - bi (@) - bow) ++ de(@), kK E Zh, 


wo(@) 


, keZ 
bo(@) - b-1(@) - b-2(@) «+ bea 1(@) 


wk(@) = 


and 


wz (@) = Wl) + 1(@) + 2(@)-- %K(w), ke ZK, 


wo) 2 
Yo(@) + Y-1(@) « Y-2(@) +++ Yee 1 (@)’ > 


w,(@) = 


(the unicity of the weight sequences (wx (@)) ez and (w, (@))xez is understood up 
to the constant factors wo(@) and Wo (w)). 


Let us now introduce the “adjoint” random walk in random environment 
(1 nex. For every w € 2 and for the family (P®), <q of probability measures, 
the sequence (X A neN is a Markov chain on Z whose elements of the corresponding 
Markov transition matrix are defined by 


P°(X) = 0) =1, 
P°(X, =k—1/X, =k) = pmo), 


P°(X,41 =k/X, =) =r(m*o), 


/ 


P°(X,., =k+1/X, =k) =1- pmo) — rim’) = qim*o), ke Z, 


no 
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1(m*a) (me) 1(m'e) 1(m°o) 1(m'o) 1(m’o) 1(m’o) 


Fig. 4 The adjoint Markov chain (X, \neN (random environments) 


as it is shown in Fig. 4. So we have the following (Ganatsiou [4]): 


eye . . / 
Proposition 4 For «1 almost every environment w € 92, the chain (X,)nen has a 
unique circuit and weight representation. 


Proof As in Proposition 3, the problem we have also to manage here is the definition 
of the weights and circuits. To this direction, we may denote by w, (@) the weight of 
the circuit c, = (k+1,k,k+1) and by w, (w) the weight of the circuit c, = (k, k), 
for every k € Z. By using an analogous way of that given before for the chain 
(Xn)nen, let us consider the sequences (£4 (@))xez and (v;(@)) ez defined by 


£.(@) = Wp) vE(@) = Wp) 
w,(@) ” w, (a) ” 
such that 
_ p(m‘a) _ pim‘o) py 
eo) Tony = Fre) ae) a ") 


r(m‘—!o) 1- p(mk w) - r(mkw) 


VE(@) = -€k(@), (7) 


r(mko) “P= pim*—!a@) — rim!) 


for every k € Z. 
Then, the sequences of weights (wi (@) kez and (wy (@))kez are defined 
uniquely by 


wy () 
£1(@) « €2(@) - £3(@) ++ Ex (@)’ 
Wz (w) = wo(@) - Co(@) - €-1(@) « -2(@) +++ Ce43(@) Leo) Cey(@), kEZ*, 


w,(@) = ive 


and 


m 


Wo (w) 
v1(@) + V2(@) +++ VE(@)’ 


wy (@) = ke@, 


Wz (@) = Wy (@)V9(@) - V-1(@) « V-2(@) ++ 443 (@) - VEL) VEGI), KE TE. 
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(the unicity of the weight sequences (w; (@))kez and (wy (@))xez is understood up 
" me 
to the constant factors wy(@) and wp» (@)). 


3 The Expected Extinction Time for the Adjoint Circuit 
Chains Associated with a Random Walk with Jumps in 
Random Environments 


3.1 For the Circuit Chain (Xn)nen 


We distinguish the following cases regarding k € Z. 


3.1.1 ke Zy 


Let us consider that the state 0 is a recurrent, absorbing state, that is p(m°w) = 0 and 
q(m°w) = 0, @ € 2. This means that the system will become extinct at some point, 
and it cannot recover once it has become extinct (Fig. 3). Let also t,(@), w € 92, be 
the expected time before the system entries in the state 0 conditioned that the initial 
state is the state k, k € Z. We have that to(w) = 0, p(m°a) = 0, q(ma) = 0, 
p(m‘w) + r(mkw) + q(m*w) =1,k € Z4,w € Q. Then we may take that 


te(w) = p(m‘o) [1+ te1(@) + q(m*o) [1 + %—1(@) 1 +0 — g(m*o) — p(m*o) 1 + (@)] 


or 
t =1, q(m"o) t t keZ Q 
k+1(@) = te(w) + BOHED) k(@) — th_1(@) aanke |" + WER. 
(8) 
Iterating the above Eq. (8) and since fo(@) = 0, we may obtain that 
e-1 1 k 
_ q(m'o)...q(m*o) 1 
(0) = NO) Yay pay eas (9) 
3 p(m'o) 2 gd Rew 
Pars q(m'o)...q(m'a) -—. , 


In order to determine the exact value of t)(@), @ € 92, we modify the circuit 
chain (Xy)nen such that p(m°o) = |. Since tj(@) = E(To(w)) — 1, where To(w) 
is the first return time, for every w € 92, with E(7To(@)) = may it remains to 
determine the exact value of z9(@). To this direction, let my(w), k € Zi, € Q, 
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be the stationary distribution of the modified circuit chain (Xy)nen satisfying the 
following relation: 


mk (w) = te-1(@) « pon" w) + mte¢1(@) - g(m*t ow) + [! — p(n‘) — gim'w)| -mK(), 


keZ*, with: mo(w) =m (@)-q(m'w), we. 


By rearranging the above equation, we may take that 


k 
p(m*o) 
Tei (@) = gam +1) -m(@), kEZi, wEQ, 
or equivalently 
1 k-1 
ih er a kel, BER, 


q(m'a)...g(mko) 


Taking into account that pape 0 @k(@) = 1, we have that 


-1 
i= fey p(m'a)... p(m | 


2 aCe). va o) 


if and only if ye oat <+o0o, w € 2. Hence, we have that 


1 1 ; = pim'o)... p(m*!o) 
mo) —— g(m'w) © 4 q(m'w)...q(mkw) 


t1(@) = E(To(@)) —1 = SOE. 


By substituting ft; (@) in relation (9), we may obtain 


i-l 


qim'w)...qnkw) Spina)... p(m'—'e) 
vio) =nion+ | am To). noo Taleo | 


(10) 
£=2,3,..., mE. 


Since 


_ pim'o)...p(m*o) _ we) : 
bi (@) - bo(w)... by (@) = es es = nto? k Zi, we, 
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from relation (10), we may obtain that 


+00 . 
onion yo > wie, £=2,3,..., wEQ. 
=k+ 


, Pima) 
(11) 
Similarly, since 
p(m°a) r(m‘o) w,(w) 
Vi(@)Y2(@) ..- Yke(@) = a Ta lb1(@)- b2(@)... bx (@)] = =—, 
r(m’ow) p(m*o) W9(w) 
keZ, w€ 2, we may also obtain that 
£-1 a k = k +00 d 
te=ft (a) + > : pie a qim a i Es c ba a) i 
a ponka) wo) 44, = pina) — gino) 
(12) 
or equivalently 
pl — panko) —qimko)  wei(o) w, (o) 
te(@) = ti(@) + [ k re) , 7 i \ 
a we(@) + g(m*w) we) ey 1 — p(m'a) — q(m'o) 
£=2,3,..., WE. 


(13) 


Relations(11), (12) and (13) are suitable expressions of the expected time to 
extinction ty(w), for every £ € Z% */{1} for the circuit chain (X;)nen through its 


unique representation by the sequences of directed circuits (c;) and (c, and weights 
(wx (@)) and (w, (w)), k € Z, respectively, for every random environment w € 22. 


3.1.2 ke Z* 


Following an analogous way of that given in Sect.3.1.1, fork € Z4, we have 
similarly that 


p(w) 
q(m*w) 


keZ,weQ. 
(14) 


te-1(@) = th(@) + [0 — th41(@) 


1 
oa 


232 C. Ganatsiou 


Iterating the above Eq. (14) and since to(w) = 0, we may take that 


al p(m—'w) re p(m*o) 


He) Dla). ca FO) [-1@)- 
1 : q(m—!w) ae (uae) es BG 
p(m—!a) p(m—e@)... p(mia) | “ee ad , 


==2 
(15) 


Let m(@), @ € 2, k € Z_, be the stationary distribution of the modified 
nonhomogeneous random walk with jumps (q(m°w) = 1) in random environments 


(Xn)nen Satisfying the following relation: 


x (@) = Te-1(@) « pom") + Tre¢ 1 (@) - g(m**!@) + [1 — pOnkw) — q(m*o)] - me (), 


ke Z*, with 29(w) = 1_1(o)- p(m~'o), w € 2. 


By rearranging the above equation, we may take that 


k+1 
ja este, REE DER. 
p(m*w) 
or equivalently 
qim**!a) - g(m***@) ...q(m7'o) 


-mo(w), kKEZ* we. 


TNO) nkas)- pon la) .<.pOn10) 


Since Vico IK (@) = 1, we may have that 


k+1 


q(m— we .q(m*" a) 
=/1 
To(@) 1D oe | 


k+l 
if and only if °,2° q(m*a)...g(m**e) 


Aes ku <+00, w€ §2.So we may obtain that 


=" ad _ 1 — g(m'a)...g(m*t!o) 
lc eae Tto(w) = p(m—'o) 2 p(m—'@)... p(mka) pees) 


where *7o(q) is the first return time in the case k € Z* with E(*To(w)) = 


1 
79 (@)’ 
for every w € 22. 
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By substituting t_;(@) in relation (15), we may have that 


f+1 
ww) =t-1(0) + > | 


k=—1 


p(m—'w) ed p(m*o) 
q(m—!o)...q(mkw) 


— dine est) 

; =] i ; 
hora p(m—*a)... p(m'o) 

Furthermore, since 


p(m)... p(m**'w) _ wo(@) 


7010) PO) Pe) = Dea). glm™Fa) ~ wee) 


we have that 


wf Sw) 
te(w) = t-1(0) + Y. | ——~. > ————], £=...,-3,-2, we @. 


, i 
or | Wke-1@) 2, amo) 


Similarly, since 


r(m°w) p(m—!@)... p(m**!w)  p(m*o) _ wo (o) 


r(m*w)gq(m—!o) ---g(mk*!o) q(m°w) w,(@) 


Yo(@)-y-1(@) +++ ¥k41(@) = 


we may also have that 


+1 k k =o ’ 
1 — p(m*a) — q(m*o) 1 w;() 
= 1l_ + ar) . i i 
te(@) = t_1(@) 2 q(mkw) wi (w) haat 1 — p(mia) — q(mio) 
(18) 
or equivalently 
Manes: [a pered— aire) wf). 5° vite 
@) = 1_-|(@ . " Fi Hi ’ 
: me LL pinko)- wie i@) wy (o) 44, T= pm'o) — qim'a) 
l=...,-3,-2,wE Q. 


(19) 


Relations (17), (18) and (19) are suitable expressions of the mean time to extinction 
te(w), for every € € Z* /{—1}, for the circuit chain (X,),en through its unique 
representation by the directed circuits (c,) and (c,) and weights (wx (@)), (w; (@)) 
and k € Z_, respectively, for every random environment w € (2. 
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3.2 For the Circuit Chain (X,,)nen 


Similarly, we distinguish the following cases regarding k € Z. 


3.2.1 ke Z4 


By using an analogous way of that given in Sect. 3.1 for the circuit chain (Xn)nen, 
let us consider that the state 0 is a recurrent absorbing state, that is, ¢(m°w) = Oand 
p(m°w) = 0, w € &. This means that the system will be extinct at some point. 
Let also a (@), @ € 2, be the expected time before the system entries in the state 
0 conditioned that the initial state is the state k,k € Zi, w € 92. We have that 
ty(@) = 0, p(m°w) = 0, g(m°w) = 0, p(m*w) + r(mkw) + q(m‘w) = 1,k € Z, 
@ € §2. Then, we may have that 


pimko) 
q(mkw) 


te, 1(@) = t(@) + 


/ / 1 * 
[i () 4 (0) S|: keZt, wEQ. 
(20) 


By iterating the above Eq. (20) and since fe (w) = 0, we may take that 


mi- 1 


_ s pila)... pimkw) 1 q(m'o).. o) 
ne) = fio Ta) atFoy [pea y Pas pata | 


C2 Fai, BE, (21) 


In order to determine the exact value of i; (w), @ € 2, we modify the circuit chain 
(X),nen such that g(m°w) = 1. Since t;(@) = E(T(w)) — 1, where Tj (@) is the 


first return time, for every w € 92, with E a. (w)) = aoa it remains to determine 
aL) @ 


the exact value of T(@). To this direction, let 1, (w),k € Zi, ow € &, be the 


stationary distribution of the modified circuit chain (X, ane Nn Satisfying the following 
relation: 


my (@) = y_1(w) -q(m*!@) + m4 ,(@) - p(m**o) + 11 — p(n‘) — q(m‘o)] - 1; (o) 


ke Z* with 1)(@) =7,(@)- pim'o), we &. 
By rearranging the above equation, we get that 


q(m‘w) 


pene ee ke Zy, woe, 


/ 
Ty (@) = 
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or equivalently 


q(m'w) aft .gim!a) 


ee ee Le. Ee. 
pim'a)... p(mka) Mo(@) ae 


1,(@) = 


Since pa: aes (w) = 1, we have that 


-l 
mk-! 
rio) = [i ey Se te 


a Oe Soak o) 


lee) qi! @).. gin Io) 
if and only if that 0° 1 palo) ptmkay < TOO€ 2. 
Hence, we have that 


t,(w) = E(Ty(@)) - 1 i pa ae) ER 
1(@) = E(Ty(@  mg@) pinto) p(n!) ... pamkw) * ad 


Finally, by substituting a (w) in relation (21), we may obtain that 


1 k +00 1 f=] 
(to) =e +o | @)... p(m*a) | y q(m'o)...gq(m | I. 


f= q(m'o)...q(mko) ai p(m'a)... p(m'a) 


£=2,3,..., wE. 
(22) 


Furthermore, since 


p(m'w) ae p(mkao) _ wy (w) 


: Pere = = 
£1(0) - €2(0)...te(0) = TOT) —aGaka) 7 aay’ 


keZi, we, 


we have that 


€-1 1 +00 w, (@) 
ayant D py ol, £8 So ccy EW, 


w,(@) ame q (m'w) 
(23) 


Similarly, since 


0 k Hoss k m 
rm?e) ginko) pinta): pimko) Ww) oa weg, 
qi) r(m¥o) g(m'w)---qimkw) wy w) 


V1 (@)-v2(@) ... Ve (@) = 
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we may also have that 


é-1 +00 ” 
1 = p(m'o)— gino) I w, (o) 
= + : 77 : > : 
tp(w) = t)(@) 2 q(mko) W, (w) raat 1 — p(mia) — q(mio) 
(24) 
or equivalently 
é-1 k k " +00 ” 
' f 1— p(m‘w) — q(m*w) — Wz_,(@) w; (a) 
J =o as ” ' "7 7 . > ; 
7 (@) 1(@) » [ w,(@) i pimkw) Wy, (w) aw 1 p(mia) = mae 
$= 253-0085 ae: 
(25) 


Relations (23), (24) and (25) give suitable expressions of the mean time to 
extinction t,(w) for every € € Zi /{1}, for the circuit chain (X;)nen through its 


unique representation by the directed circuits (c,) and (c,) and weights (w, (@)) 
and (wy (w)), k € Z4, respectively, for every random environment w € 2. 


3.2.2 ke Z_ 


By using an analogous way of that given in Sect. 3.2.1 fork € Z_, we have similarly 
that 
q(m‘a) 


; Z*, we. 
p(m*a) 


(26) 


ty (@) = 4 (@) + 


/ ri 1 
[ior ~ fee @) al 


Iterating the above relation (26) and since ty (w) = 0, we may obtain that 


a, So din-'o)...g(mbo) ¢ 
eae & p(m—'w)... p(mkw) | =i?) 
1 k ee en 2 es 320 €2 
qim-'w) =, q(mla)...g(mia) Po | 


(27) 
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Let Ty (w),k € Z,w € QQ, be the stationary distribution of the modified circuit 
chain (X) nen ( p(m°o) = 1) satisfying the following relation: 


4 (w) = my (@)-q(n*!) +9,  (@):- pmo) +[1— pnw) —qin*w) |-m, (@), k EZ", 


with 2)(w) = qm!) -1_,(@), 0 € 2. 
By rearranging the above equation, we may take that 


k+1 
' pmo) + 
Oa Gay OE, 
or equivalently 
p(mk+lq) : pim*+?@) Pets p(m—'w) 


‘(@)= pO) ke Lt Q. 
FON" Gat) ah). <cqtn to, OO ASE OS 


Similarly, we have that 


=, amto)...q(m‘o) 


-l 
; k+1 
ee pay es DE - p(m 2 


k+l 
if and only if )°; 2°, + <+00,w€ 2. 


So finally, we may obtain that 


1 _ 1 — pim'a)... p(m*t!o) 


7 cS =I = k 
Ty (@) q(m~‘a@) ae q(m—'w)...q(m*o) 


’ 


t_,() = E(*Ty(w))—1 = 


where *T, (@) is the first return time in the case k € Z* with E(* Z (w)) = ae 
for every w € 22. 
Finally, we get that 
~ g(mto)..-g(nka) pom!) ... pim'+!a) 
MO =H) + YT SS —| 
pm)... p(mkw) £~*, ia ---q(m'a) 
€=...,-3,-2, we. 
(28) 


Furthermore, since 


p(m°a) p(m~!a)... p(m*+!a) 7 w; (a) 
g(m°a)q(m—!w)...g(mk+1o) wo(@) 


lo(@) - €-1(@)...€k41(@@) = 
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we have that 


l+1 


, , 1 = w,_,(@) 
f,(@) = t_,(@) + ae Ey BS 8 Od ee, 
dX Wy_1(@) x pure) 


Similarly, since 


k+l 


q(mo) r(m*w) pane) p(m7!w) ...p(m**o) = wy (w) 


rim) q(mkw) q(m°o)q(me)...gm*10) ~ wi (@)’ 


Vo (@)-v_1(@) eda Ve+1(@) = 


we may also obtain that 


+1 k k —0o m 
a ' 1 — p(m*a) — q(m*w) 1 w; (@) 
t =t_ + 77 : ; : 
(@) 1(@) >» p(mkw) wy (a) Raat 1 — p(miw) — q(mia) 
(30) 
or equivalently 
é+1 —00 


1 — p(m'w) — q(mkw) — w,(@) w, (w) 


Ho) =t_(o) + D> | : 


= q(mko) ‘ w,_,(@) Wy (@) ; pope 1- p(m'o) - re 


es ee ee. (31) 


Relations (29), (30) and (31) are suitable expressions of the expected extinction 
time a (w) for every £ € Z* /{—1}, for the circuit chain (X ) neN through its unique 


representation by the sequences of directed circuits (cy) and (cy) and weights 


(w; (w)) and (wy (w)), k € Z_, respectively, for every random environment w € 2. 
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A Statistical Learning Theory Approach ®) 
for the Analysis of the Trade-off Between 
Sample Size and Precision in Truncated 
Ordinary Least Squares 


Giorgio Gnecco, Fabio Raciti, and Daniela Selvi 


Abstract This chapter deals with linear regression problems for which one has 
the possibility of varying the supervision cost per example, by controlling the 
conditional variance of the output given the feature vector. For a fixed upper bound 
on the total available supervision cost, the trade-off between the number of training 
examples and their precision of supervision is investigated, using a nonasymptotic 
data-independent bound from the literature in statistical learning theory. This bound 
is related to the truncated output of the ordinary least squares regression algorithm. 
The results of the analysis are also compared theoretically with the ones obtained 
in a previous work, based on a large-sample approximation of the untruncated 
output of ordinary least squares. Advantages and disadvantages of the investigated 
approach are discussed. 


1 Introduction 


In various problems related to economics, engineering, physics, and other fields, 
one is often required to approximate an unknown regression function using a finite 
set of input-output noisy training examples, where the input is typically a finite- 
dimensional vector of features, and the output is provided by a suitable expert. 
The goal is to learn a model of the input—output relationship, which is not limited 
to explain the data belonging to the training set, but is also able to predict the 
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output associated with a new test input example, not available in the training phase. 
Supervised machine learning [10, Chapter 2] and, in particular, its mathematical 
foundation provided by Statistical Learning Theory (SLT) [16] provide methods 
and algorithms to investigate and solve this kind of problems, by formulating 
and optimizing a suitable cost expressing the trade-off between minimizing the 
average prediction error on the training set and guaranteeing a proper generalization 
capability on input examples that were not used for training. 

In some supervised learning problems of practical interest, the noise variance of 
the output can be controlled to some extent. For instance, a smaller variance could 
be obtained by increasing the cost associated with the acquisition of each supervised 
example. To achieve this goal, one could use more precise (but also more expensive) 
measurement devices. Analogously, the supervision could be provided by experts in 
the specific field, but in general, the higher the experience, the higher the cost. In all 
these cases, it is reasonable to investigate—then, to optimize—the trade-off between 
the number of labeled examples used for training and their precision of supervision, 
with respect to a suitably defined generalization error associated with the function 
approximation learned from the training set. This trade-off was recently analyzed 
and optimized in [3], which investigated a modification of the classical linear 
regression problem. In the situation analyzed in [3], given a fixed upper bound on the 
total available supervision time, one has the possibility of varying the time (hence, 
the cost) dedicated to the supervision of each training example, thus controlling 
the conditional variance of the output given the feature vector. By combining 
the estimates of the model parameters provided by the Ordinary Least Squares 
(OLS) regression algorithm with a suitable large-sample approximation, it was 
shown therein that the optimal choice of the supervision time per example strongly 
depends on the assumptions made on the noise: For instance, if the precision of 
each supervision (defined as the reciprocal of the variance of the measured output) 
has “decreasing returns to scale” with respect to the associated supervision time 
per example (i.e., it is a strictly convex and increasing function of such time), 
then the large-sample approximation of the generalization error is minimized by 
choosing the smallest possible supervision time per example (assuming that this 
belongs to a closed and bounded interval for which the large-sample approximation 
holds). This corresponds to maximizing the size of the training set. In [4], the 
analysis made in [3] was refined and extended by considering also the case in which 
distinct training examples are associated with one of two different supervision times 
per example, and one has the possibility to optimize the percentage of examples 
assigned to one of these two times, taking also into account an upper bound on 
the total supervision time. For this situation, it was shown therein that OLS and 
an alternative algorithm—Weighted Least Squares (WLS)—which in principle is 
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more appropriate for this specific learning problem,! produce the same results at 
optimality. 

In this chapter, the trade-off between the number of supervised examples and 
their labeling precision is also explored, but using a different approach, based on 
SLT. The aim is to deal with two drawbacks still present in the analysis performed 
in [3, 4]: They hold only asymptotically, and they are focused on the optimization 
of the generalization error conditioned on the training input data matrix, whereas 
typically, in supervised machine learning, its average with respect to the random 
generation of the whole training set is considered. For these reasons, in the follow- 
ing, we explore the possibility of applying a bound from SLT (provided in [9]), to 
optimize the trade-off between sample size and precision of supervision in linear 
regression. Differently from [3], both a finite sample size and the unconditional 
generalization error are considered. However, this is obtained at the cost of replacing 
the OLS algorithm with its truncated version from [9], which is needed to apply that 
SLT bound. The chapter is structured as follows. Section 2 introduces the model 
under investigation. Section 3 analyzes and optimizes the trade-off between number 
of examples and precision, by exploiting an SLT bound. Finally, Sect. 4 discusses 
the obtained results, together with possible extensions. 


2 Model 


The following linear model for a static input—output relationship is considered: 
y= 6'x, (1) 


where x € R?*! is a random feature vector, 8 € R?*! a parameter vector, and y € 
IR the dependent variable. Assume that, for each x € R?*!, one can measure—at 
a positive supervision cost per example c € [Cyyjn, Cmax ]—only an approximation 
of y, corrupted by an additive supervision noise, which is modeled as a random 
variable €,, independent from x, with mean 0 and variance 


o2 =ke™, (2) 
being k, a > 0 (see [3] for some details on the interpretation of this noise model, 


where c is replaced by a supervision time per example AT). In order to approximate 
the parameter vector 6 at a total available supervision cost smaller than or equal to 


| For the two respective problems considered in [3] and in the extended framework of [4], OLS and 
WLS provide the best linear unbiased estimates of the parameter vector of the linear regression 
model, according to Gauss—Markov theorem [13, Section 9.4]. This depends on the fact that the 
measurement noise is homoskedastic in the framework considered in [3], and heteroskedastic in its 
extension considered in [4]. 


244 G. Gnecco et al. 


C > cmax Via an estimate B and predict the uncorrupted output 


test = 1 fest (3) 


y 


for a new test example x'@St 


number 


, only a noisy training set is available, made of a finite 


fe 
Ne=|— (4) 
c 


of independent and identically distributed supervised examples (x,,¥n) (n = 


1,..., Nc), where x,, is distributed as x, and y, is an approximation of 
Yn =BXn> (5) 
modeled as 
Yn = Yn + En,cs (6) 


being the ¢,,- independent and identically distributed as ¢,. For each c, the estimate 
of the parameter vector 6 is provided by the classical OLS algorithm (see, e.g., 


[7]), and is denoted by B. . In the following, its form is recalled to the reader. Let 


Xn, € R%*? denote the training input data matrix, whose generic n-th row is Xs 


and 7 € IR%c*! the vector whose generic n-th element is j,. In general, the OLS 
estimate of the parameter vector 6 has the form 


5, (7) 


where a is the Moore—Penrose pseudoinverse of Xy.. In the typical case for 
which the matrix X Ne Xv, 1s invertible, one has 


-1 
Xy, = (Xy. Xn) XN» (8) 
and the OLS estimate can be expressed as 
“ -1 a 
t= (Xv, Xn.) Xd (9) 
It is also assumed here that, for some given L > 0, one has 


|B'x| < L (10) 
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on the support of the probability distribution of x. To take into account this prior 


knowledge, for each c, the OLS estimate B xtest of ytest can be replaced by the 
following truncated version: 


eee), (11) 


where 77, denotes the projection operator from R onto the interval [—L, L]. In the 
following, yest is referred to as the truncated OLS estimate of the output associated 
with a test input. 


3 Optimal Trade-off Between Sample Size and Precision 
of Supervision via Statistical Learning Theory 


In the following analysis, the generalization error (or expected risk) associated with 
the square loss 


2 
Re =E { (st _ ) (12) 


test test 


is considered, where the feature vector x'**" generating y is independent from 
the training examples and has the same probability distribution as x, and the 
expectation above is with respect to both the test example and the training set. 

In many situations investigated by SLT, the generalization error (12) cannot be 
computed exactly, hence an upper bound on it is minimized. So, in the following, 
the interest is in applying SLT to compare, for different values of c, the resulting 
upper bound on the generalization error (12). According to [9, Theorem 11.3], the 


? The statement of [9, Theorem 11.3] actually includes a universal positive constant, the value of 
which is not explicitly reported therein. However, it can be computed by an inspection of the proof 
of such theorem, as summarized in the following. In more details, such proof shows that 


Re? < E{T,n,.} + E{Ton.} (13) 


(see [9, Section 7.1] for the precise definitions of the random variables 7; and 72,1), and that 


{Tin} <v +9 (12eN,)7PF) . (14) 


23041? 


c 


2304.2 Nev 
N 


for 


2304? 
v= 


-In (9 (12eN.)2"*?) (15) 


¢ 
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following upper bound on the generalization error (12) holds for each c: 


2 
per < 2304L 


(1 +1n (9 (12eN)°+D)) 4 8025 : (19) 


c 


By inserting in (19) the expressions (2) and (4) for a and N-, respectively, one gets 


2 (p+) 
Re? < =] (: +n (> (126 |< ]) )) z: Bee (20) 


In the following, the interest is in minimizing the upper bound (20) on the 
generalization error with respect to c € [Cyjp,Cmax]. The focus here is on the 
case 0 < a < 1, which, as in [3], models “decreasing returns of scale” of the 
precision 4 of the supervision per example with respect to its cost c (i.e., if one 


2 
c 


doubles c, then, according to (2), + increases less than two times its initial value). 


Indeed, a simpler analysis can be done in that situation than in the general case. By 
elementary calculus, one can show that the first term on the right-hand side of (20), 


2304L? Cyrer) 
f(d= ey (: +1n (° (126 |<]) )) (21) 


interpreted as a function of c, is continuous from the left with positive jump 
discontinuities, and piecewise constant between consecutive jumps (see Fig. 1a). 
Similarly, one can prove that the second term on the right-hand side of (20), 


g(c) = 8kc *—_ , (22) 


P 
LE 


interpreted as a function of c, is continuous from the left with positive jump 
discontinuities, and piecewise decreasing between consecutive jumps (see Fig. 1b). 


whereas 


~ 2 Po : 4%! test _ test 7 
{Ton} < 80; N. oe | (Bx y ) 7 (16) 


In the specific framework considered in this chapter, one has 


ayes! = Blx'est ’ (17) 


hence the upper bound (16) is reduced to 


(18) 


Finally, the upper bound (19) presented in the text follows by combining (13)-(18). 
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Hence, their sum 


h(c) = f(c) + g@) (23) 


on the right side of the bound (20) (see Fig. 1c) has the same properties as g(c) and 
achieves a global minimum on the interval [cy)in, Cmax]. Moreover, since for every 
supervision cost per example c one has 


. 1<(£] =, (24) 


one gets, respectively, 


2 AL 2 2(p+1) 
zi (: +1n (> (12e£) 
poe Cc 
Cc 


<f(c) 
23041? . ae 
= (1+ (> (126 (C = ')) (25) 
C_y Cc 
a 
Be" F< glo) < Bke (26) 
e eu 
and 
2 AL? 2(p+1) 
= (: +In (> (125) + 8ke" E 
& Cc & 
CC ¢ 
<h(c) 


2304L? C apr t) 
sce (: +n ( (126 (< a 1)) + 8ke" 5? : (27) 
Gy a. Cc c_ 
¢ c 


Hence, each of the three functions f(c), g(c), and h(c) has increasing lower and 
upper “envelopes,” which are, respectively, the leftmost and rightmost functions 
reported in (25), (26), and (27). Such envelopes are shown in Fig. la—c. Moreover, 
due to its properties reported above, the function /(c) has a unique global minimum 
point c° on [¢Cyin» Cmax]. This choice for c minimizes the upper bound (20) on the 
generalization error (although it does not necessarily minimize the generalization 
error itself). If C is a multiple of cyjpy, then the global minimum point of h(c) is 
c° = Cpin- Otherwise, it is the leftmost discontinuity point of h(c) on [Cyin, Cmax | 
(or Cmax itself, in the less typical case in which all the discontinuity points of h(c) 
are outside [Cyin, Cmax). 
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Remark I For illustrative purposes, in Fig. |, the values of the parameters have been 
chosen in such a way to obtain comparable ranges for the three functions f(c), g(c), 
and h(c), in order to make their properties easily recognizable. In practice, however, 
it is more reasonable to consider situations for which, uniformly on [Cmin, Cmax], the 
standard deviation of the noise o, is smaller than or equal to L, which, according 
to (10), is a uniform upper bound on the absolute value of the regression function. 
In this case, a numerical study shows that the term f(c) is much larger than g(c), 
hence the right-hand side of the bound (20) is reduced to h(c) ~ f(c). Moreover, 
although the bound itself holds for every finite sample size, it is meaningful only 
when its right-hand side is uniformly smaller than L (otherwise, the trivial estimate 
B = 0 would be associated with a generalization error smaller than or equal to the 
upper bound (20)). 


Remark 2 For a > 1, the analysis of the upper bound (20) on the generalization 
error is slightly more involved, since it requires a numerical study leading case-by- 
case to conclusions that could be less generalizable, so that the uniqueness of the 
global minimum point may be lost. However, when the condition o, < L holds 
uniformly on [Cyjn» Cmax] (and even under weaker conditions when a > 1), 
the approximation h(c) ~ f(c) reported in Remark 1 can still be applied, and 
the analysis is practically reduced to the corresponding one reported above for 
O<a<l. 


Remark 3 Figure lc shows that the density of the discontinuity points of h(c) (Le., 
their number divided by the length of the interval [cy,j,, Cmax]) increases when 
approaching the minimum supervision cost per example cy)jn- It also suggests that, 
if there is a sufficiently large number of such discontinuity points in the optimization 
domain, then the optimal solution is close to Cyjp. This is in agreement with the 
results of the analysis performed in [3], although in that case: 


(a) The generalization error conditioned on the training input data matrix was 
considered. 

(b) No truncation was applied to the OLS estimate of the output associated with a 
test input. 

(c) A large-sample approximation of the conditional generalization error was 
adopted. 


However, when the approximation h(c) ~ f(c) can be applied, there is less 
agreement between the results of the two analyses when a > | or a = 1. In these 
cases, [3] provides, respectively, the optimal solutions c° = cmax and c° = any 
element of [Cyyjn» Cmax]. This is likely due to the same reasons (a), (b), and (c) 
above, and also to the fact that (20) provides only a (possibly loose) upper bound on 
the generalization error, whereas the analysis made in [3] is based on a probability 
limit. 


Statistical Learning Theory and Truncated Ordinary Least Square 


Fig. 1 Plots of: (a) the 
function f (c) defined in (21) 
and its envelopes from (25); 
(b) the function g(c) defined 
in (22) and its envelopes from 
(26); (ce) the function h(c) 
defined in (23) and its 
envelopes from (27). The 
values of the parameters are: 
p=5,L=0.1, Cmin = 2, 
Cmax = 10, C = 21, k = 100, 
anda = 0.5 
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4 Discussion 


Up to the authors’ knowledge, the analysis and the optimization of the trade-off 
between sample size and precision of supervision in machine learning problems 
using an SLT approach is explored not often in the literature. The problem 
investigated in this chapter, however, has some similarities with the optimization 
of sample survey design, where some parameters associated with the design are 
optimized in order to minimize its sampling variance (see, e.g., the classical 
Neyman allocation in stratified sampling [8], for a given size of the dataset). It 
is also similar to the optimization of the design of measurement devices (see, 
e.g., [11], which investigates, however, a framework in which linear regression 
is marginally involved, since only arithmetic averages of measurement results are 
considered therein). Other problems in statistics that can be naturally combined with 
optimization are reported, e.g., in [12]. 

Compared with other bounds from SLT (e.g., the classical one reported in [16, 
Section 5.3], which is expressed in terms of the Vapnik—Chervonenkis dimension 
for a set of bounded real-valued functions), the one provided in (20) and used in 
the analysis presented in this chapter has the advantage that it does not involve 
any approximation of the expected risk in terms of an empirical risk ( combined 
with a regularization term), being the empirical risk a way to express the average 
performance of the learned machine over the training set. Hence, the approach 
presented in this chapter allows at the same time to perform model selection (i.e., the 
choice of an optimal supervision cost per example) before processing the training 
data, and to consider in the analysis a finite sample size. A disadvantage of our 
approach, however, is that the bound requires a sufficiently large sample size to 
be potentially useful in practice. As a possible extension, the trade-off between 
sample size and precision of supervision could also be investigated starting from 
other SLT bounds valid directly for OLS [2], without requiring any truncation of its 
output. Moreover, the SLT bound (20) could be extended to the case of truncated 
WLS. In this case, distinct training examples could be associated with one of two 
different supervision times per example, and there is the possibility to optimize 
the percentage of examples assigned to each of these two times, taking also into 
account an upper bound on the total supervision time. Such an extension looks 
feasible, since WLS is equivalent to OLS applied to a suitable linear transformation 
of the data [13, Section 18.5]. Another possible extension concerns the choice of 
Tikhonov regularization as learning algorithm. Indeed, under conditions including 
convexity and boundedness of the loss function used to measure the quality of 
each prediction, nonasymptotic data-independent SLT bounds can be derived for 
Tikhonov regularization (see, e.g., the ones stated in [14, Chapter 13, Corollaries 
13.9 and 13.11]). Such bounds, which do not require any a-posteriori evaluation 
of empirical risks, are based on the connection between regularization, stability 
of learning algorithms, and generalization capability [1, 15]. Other extensions of 
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the analysis to panel data settings, like the fixed effects panel data model? with 
controllable variance of the noise,+ would require the preliminary derivation of 
SLT bounds similar to (20), but valid for the specific learning algorithms used in 
those frameworks. For the unbalanced fixed effects data model, the investigation via 
SLT of the trade-off between number of supervised examples and their precision 
of supervision would lead to higher-dimensional optimization problems than the 
one investigated in this chapter, since such a model is characterized by several 
optimizable parameters [6]. 


Acknowledgments G. Gnecco and F. Raciti are members of the Gruppo Nazionale per I’ Analisi 
Matematica, la Probabilita e le loro Applicazioni (GNAMPA) of the Istituto Nazionale di Alta 
Matematica (INdAM). The work was partially supported by the 2020 Italian project “Trade-off 
between Number of Examples and Precision in Variations of the Fixed-Effects Panel Data Model”, 
funded by INJ/AM-GNAMPA. 


References 


1. O. Bousquet, A. Elisseeff, Stability and generalization. J. Mach. Learn. Res. 2, 499-526 (2002) 

2. O. Catoni, I. Giulini, Dimension-free PAC-Bayesian bounds for matrices, vectors, and linear 
least squares regression, technical report (2017). https://arxiv.org/abs/1712.02747 

3. G. Gnecco, F. Nutarelli, On the trade-off between number of examples and precision of 
supervision in regression problems, in Proceedings of the Fourth International Conference 
of the International Neural Network Society on Big Data and Deep Learning (INNS BDDL 
2019), Sestri Levante, Italy, 1-6 (2019) 

4. G. Gnecco, F. Nutarelli, On the trade-off between number of examples and precision of 
supervision in machine learning problems. Optim. Lett. https://doi.org/10.1007/s11590-019- 
01486-x (2019) 

5. G. Gnecco, F. Nutarelli, Optimal trade-off between sample size and precision of supervision 
for the fixed effects panel data model, in Proceedings of the Fifth International Conference on 
machine Learning, Optimization & Data science (LOD 2019), Certosa di Pontignano (Siena), 
Italy, vol. 11943 of Lecture Notes in Computer Science (2020), pp. 531-542 

6. G. Gnecco, F. Nutarelli, D. Selvi, Optimal trade-off between sample size, precision of 
supervision, and selection probabilities for the unbalanced fixed effects panel data model. Soft 
Comput. (2020). https://doi.org/10.1007/s00500-020-05317-5 

7. J.F. GroB, Linear Regression (Springer, Berlin, 2003) 

8. R.M. Groves, F.J. Fowler, Jr., M.P. Couper, J.M. Lepkowski, E. Singer, R. Tourangeau, Survey 
Methodology (Wiley-Interscience, 2004) 

9. L. Gyorfi, A. Krzyzak, M. Kohler, H. Walk, A Distribution-Free Theory of Nonparametric 
Regression (Springer, Berlin, 2002) 


3 This is a linear regression model able to represent unobserved heterogeneity in the data via 
possibly different constants associated with distinct observational units. Depending on the setting, 
such units may have the same or different numbers of associated observations. The first case is 
called a balanced panel, the second one an unbalanced panel [17]. 

4 This case was investigated in [5] and [6], respectively for balanced and unbalanced panels, relying 
in both analyses on large-sample approximations of the outputs of suitable algorithms used to 
estimate the parameters of the fixed effects panel data model. 


252 G. Gnecco et al. 


10. T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, 2nd edn. (Springer, 
Berlin, 2009) 

11. H.T. Nguyen, O. Kosheleva, V. Kreinovich, S. Ferson, Trade-off between sample size and 
accuracy: case of measurements under interval uncertainty. Int. J. Approx. Reason. 50, 1164— 
1176 (2009) 

12. J.S. Rustagi, Optimization Techniques in Statistics (Academic Press, London, 1994) 

13. PA. Ruud, An Introduction to Classical Econometric Theory (Oxford University Press, Oxford, 
2000) 

14. S. Shalev-Shwartz, S. Ben-David, Understanding Machine Learning: From Theory to Algo- 
rithms (Cambridge University Press, Cambridge, 2014) 

15. S. Shalev-Shwartz, O. Shamir, N. Srebro, K. Sridharan, Learnability, stability and uniform 
convergence. J. Mach. Learn. Res. 11, 2635-2670 (2010) 

16. V.N. Vapnik, Statistical Learning Theory (Wiley-Interscience, 1998) 

17. J.M. Wooldridge, Econometric Analysis of Cross Section and Panel Data (MIT Press, 
Cambridge, 2002) 


Recent Theoretical Advances in M®) 
Decentralized Distributed Convex ee 
Optimization 


Eduard Gorbunov, Alexander Rogozin, Aleksandr Beznosikov, 
Darina Dvinskikh, and Alexander Gasnikov 


Abstract In the last few years, the theory of decentralized distributed convex 
optimization has made significant progress. The lower bounds on communications 
rounds and oracle calls have appeared, as well as methods that reach both of these 
bounds. In this paper, we focus on how these results can be explained based on 
optimal algorithms for the non-distributed setup. In particular, we provide our recent 
results that have not been published yet and that could be found in detail only in 
arXiv preprints. 


E. Gorbunov - A. Rogozin 
Moscow Institute of Physics and Technology, Moscow, Russia 


Russian Presidential Academy of National Economy and Public Administration, Moscow, Russia 
e-mail: eduard.gorbunov @ phystech.edu; aleksandr.rogozin @ phystech.edu 


A. Beznosikov 
Moscow Institute of Physics and Technology, Moscow, Russia 
e-mail: beznosikov.an @ phystech.edu 


D. Dvinskikh 
Moscow Institute of Physics and Technology, Moscow, Russia 
High School Economic University, https://www.hse.ru/en/ 


Institute for Information Transmission Problems RAS, Moscow, Russia 
e-mail: darina.dvinskikh @ phystech.edu 


A. Gasnikov (x) 
Moscow Institute of Physics and Technology, Moscow, Russia 


Institute for Information Transmission Problems RAS, Moscow, Russia 


Caucasus Mathematical Center, Adyghe State University, Maykop, Russia 
e-mail: gasnikov.av @phystech.edu 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 253 
A. Nikeghbali et al. (eds.), High-Dimensional Optimization and Probability, 

Springer Optimization and Its Applications 191, 

https://doi.org/10.1007/978-3-03 1-00832-0_8 


254 E. Gorbunov et al. 
1 Introduction 


In this work, we focus on the following convex optimization problem 


1 m 
min, (0) = — AG, (1) 
i=l 


xEQCR"” 
where the functions {fj}, are convex and Q is a convex set. Such kind of 
problems arise in many machine learning applications [139] (e.g., empirical risk 
minimization) and statistical applications [146] (e.g., maximum likelihood estima- 
tion). To solve these problems, decentralized distributed methods are widely used 
(see [33, 111] and the reference therein). This direction has gained popularity with 
the release of the book [13]. Many researchers (among which we especially note 
Angelia Nedich) have productively promoted distributed algorithms in the last 30 
years. Due to the emergence of big data and the rapid growth of problem sizes, 
decentralized distributed methods have gained increased interest in the last decade. 
In this paper, we mainly focus on the last five years of theoretical advances, starting 
with the remarkable paper [8]. The authors of [8] introduce the lower complexity 
bounds for communication rounds required to achieve ¢-accuracy solution x“ of 
(1) in the function value, 1.e., FX) —minyeg f(x) <e. 

Let us formulate the result of [8] (see also [88, 135, 137, 143, 160]) formally. 
Assume that we have some connected undirected graph (network) with m nodes. 
For each node i of this graph, we privately assign function f; and suppose that 
the node i can calculate V f; at some point x. At each communication round the 
nodes can communicate with their neighbors, i.e., send and receive a message with 
no more than O(n) numbers. In the O(R) neighborhood of a solution x* of (1) 
(where R = ||x° — x*||2 is the Euclidean distance between starting point x° and 
the solution x* that corresponds to the minimum of this norm), we suppose that 
functions f;’s are M-Lipschitz continuous (i.e., || V fj (x) |l2 < M) and L-Lipschitz 
smooth (i.e., || V fi(y)—V f(x) |l2 < L||y—x||2). The optimal bounds on the number 
of communications and the number of oracle calls per node are summarized in 
Table 1. Here and below O( ) means the same as O() up to a log(1/e) factor, and 
OVX ) corresponds to the consensus time, that is the number of communication 
rounds required to reach the consensus in the considered network (more accurate 
definition of OW/X ) is given in Sects. 2 and 3). 

In the last few years, algorithms have been developed that reach the lower bounds 
from Table 1. In Sect.2, we consider one of such algorithms [132, 134] for the 
case when the functions f;’s are smooth. This algorithm has the simplest nature 
among all known alternatives: this is a direct consensus-projection generalization of 
Nesterov’s fast gradient method. 

When communication networks vary from time to time (time-varying commu- 
nication networks, see Sect. 2), we replace ./x by x and we suppose that different 
fi’s may have different constants of smoothness L;. 
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Table 1 Optimal bounds for communication rounds and deterministic oracle calls of V fj per 
node 


fj 1s u-strongly fy is 
convex and L-smooth| f; is L-smooth | j1-strongly convex 


# communication 6 (Ex) o( a) 0 (x) o( Ex) 


rounds 
# oracle calls of 6( L) o( ut) 0 (1) 0 (EE) 
V fi per node i 


The non-smooth case (when functions f;’s are Lipschitz continuous) is studied in 
Sect. 3, where the results from [32, 51] are summarized. The approach is based on 
reformulation of the distributed decentralized problem as non-distributed convex 
optimization problem with affine constraints, which are further brought into the 
target function as a composite quadratic penalty. To solve this problem, Lan’s sliding 
algorithm [87] can be used. 

The same construction and estimates hold (see Table 2) in the non-smooth 
stochastic case, when instead of subgradients V f;(x)’s we have an access only to 
their unbiased estimates V fj (x, &;)’s. We assume here that E||V f(x, &) I < M’ on 
a O(R) neighborhood of x*. 

The smooth part of Table 2 describes the known lower bounds. There exist 
methods that are optimal only in one of the two mentioned criteria [32]: either in 
communication rounds or in oracle calls per node. The technique from [134] (also 
described in Sect. 2) combined with proper batch-size policy [36] allows to reach 
these lower bounds up to a logarithmic factor [131]. 

Section 3 also contains analogues of the results mentioned in Tables | and 2 for 
dual (stochastic) gradient-type oracle. That is, instead of an access at each node 
to V f; we have an access to the gradient of conjugated function V f;* [135, 160]. 
Such oracle appears in different applications, in particular, in Wasserstein barycenter 
problem [31, 34, 37, 80, 158]. 

In Sect. 4, we transfer the results mentioned above to gradient-free oracle assum- 
ing that we have an access only to f; instead of V f;. In this case, a trivial solution 
comes to mind: to restore the gradient from finite differences. Based on optimal 
gradient-type methods in smooth case, it is possible to build optimal gradient- 
free methods. But what is about non-smooth case? To the best of our knowledge, 
until recently, it was an open question. Based on [15], we provide an answer for 
this question (Sect. 4). To say more precisely, we transfer optimal gradient-free 
algorithms for non-smooth (stochastic two-points) convex optimization problems 
[12, 141] from non-distributed setup to decentralized distributed one. Here, as in 
Sect. 3, we also mainly use the penalty trick and the Lan’s sliding. 

It is worth to add several results to the list of recent advances collected in Tables 1 
and 2. The first result describes the case when fj(x) = ty ha (x) in (1), 


Q = R". All 7 are L-smooth and j-strongly convex. In this case, the lower 
bounds were obtained in [59]. Optimal algorithms were proposed in [59, 97]. These 
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algorithms require 0( / t x) communication rounds and O (r + rk) oracle 


calls (V fi calculations) per node. This is valuable result since in real machine 
learning applications the sum-type representation of fj is typical. 

Another way to use this representation is statistical similarity of f;. The lower 
bound for communication rounds in deterministic smooth case was also obtained 
[8]. Roughly speaking, if the Hessians of the f;’s are 6-close in the 2-norm, then 
the lower bound for communication rounds will be 2 ( (2x ) Here 2) is the 
notation for lower bounds on the growth rate hiding logarithms. For example, if 
fi(x) = 1 ae 1 ca (x) and all a are j1-strongly convex we have B ~ w+ 
In the decentralized distributed setup there is a gap between this lower bound and 


const | 


the optimal (non-accelerated) bound O f x ) that can be achieved at the moment 


[153]. But for centralized distributed architectures with additional assumptions on 
fi’s, partial acceleration is possible [60]. In the recent paper [155] the gap for 
decentralized optimization was closed by using distributed Catalyst. 

Other group of results relate to very specific (but rather popular) centralized 
federated learning architectures [68]. According to mentioned above estimates, 
heterogeneous federated learning can be considered as a partial case (with x = 1) 
[53, 70, 75, 165]. In the paper [75], this was explained based on the analysis of 
unified decentralized SGD. Paper [75] also summarizes a lot of different distributed 
setups in one general approach. We partially try to use the generality from [75] in 
Sect. 2. To the best of our knowledge, it is an open question to accelerate all the 
results of [75]. Section 2 contains such an acceleration only in deterministic case. 

Along with minimization, distributed saddle-point problems are an interest- 
ing venue of research [104, 107, 130, 163]. The basis for distributed solution 
of min-max problems are extragradient and Mirror-Prox methods [114]. Unlike 
minimization, where the optimal dependence of iteration complexity on function 
condition number is ./«, the lower complexity bound for saddle-point algorithms 
includes «, and Nesterov acceleration does not improve classical non-accelerated 
methods. In decentralized case, the lower bound for number of communications 
is O(«,/x log(1/e)) [130]. Therefore, Nesterov acceleration technique is not 
needed for obtaining optimal methods for min-max problems both in classical 
and distributed optimization. But in particular cases (i.e., different constants of 
strong convexity and strong concavity) acceleration is possible due to distributed 
Catalyst [155] and [45, 100, 156, 169]. Note also that a lower bound and optimal 
decentralized algorithm for saddle-point problems with variance reduction was 
proposed in [18] (this paper develops the results of [60, 98]). Lower bound optimal 
decentralized algorithm for saddle-point problems with similarity was proposed in 
[20]. 
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2 Decentralized Optimization of Smooth Convex Functions 


Consider problem (1) and rewrite it in the following form: 


m 


min FO)= >) AG) st. =a. am (2) 


XeR™*" 
i=l 


where X = (x1... Xm)! . Decentralized optimization problem is now reformulated 
as an optimization problem with linear constraints. The constraint set writes as C = 
{x1 =... = Xm}. 

Functions f; are stored on the nodes across the network, which is represented 
as an undirected graph Y = (V, E). Every node has an access to the function 
and its first-order characteristics. Decentralized first-order methods use two types of 
steps—computational steps, i.e., performing local computations and communication 
steps, which is exchanging the information with neighbors. Alternating these two 
types of steps results in minimizing the objective while maintaining agents’ vectors 
approximately equal. 

We begin with an overview of how communication procedures are developed and 
analyzed. The iterative information exchange is referred to as consensus or gossip 
algorithms in the literature [22, 110, 157, 167]. 


2.1 Consensus Algorithms 


Let each agent in the network initially hold a vector a and let the communication 
network be represented by a connected graph Y = (V, E). The agents seek to find 
the average vector across the network, but their communication is restricted to send- 
ing and receiving information from their direct neighbors. In one communication 
round, every two nodes linked by an edge exchange their vector values. After that, 
agent i sums the received values with predefined coefficients m;;, where j is the 
number of the corresponding neighbor. In other words, every node runs an update 


aft! = [Mixh + D0 IMI x4, 
i, JeE 


where [M];; are elements of the mixing matrix M. The update at one communication 
step takes the form 


x amy", (3) 
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Under additional assumptions this iterative scheme converges to the average of 
initial vectors over network, i.e., to 


- 1 1 m 
XxX =-11'x°=— p 
m m 2 


Assumption 1 Mixing matrix M satisfies the following properties. 


* (Decentralized property) If (i, j) ¢ E andi # j, then [M];; = 0. Otherwise 
[M]; j > 0. 

* (Symmetry and double stochasticity) M1 = 1 and M=M. 

e (Spectrum property) Denote 42(M) the absolute value of second largest (in 
absolute value) eigenvalue of M. Then A2(M) < 1. 


The choice of weights for mixing matrix is an interesting problem we do not 
address here (see [22] for details). A mixing matrix with Metropolis weights satisfies 
Assumption |: 


1/(1+max{d;,dj}) if @, f) € E, 
[Mlij = 0 Ed Es 
1— 7 [Mlim  ifi = j, 
m:(i,m)EE 
where d; denotes the degree of node i. 
Several variations of Assumption | can be found in literature. In particular, in 
[102] the mixing matrix is not needed to be symmetric. Instead, it is assumed to 
be doubly stochastic and has a real spectrum. Moreover, the spectrum property 


in Assumption | implies that 1 is the only (up to a scaling factor) eigenvector 
corresponding to eigenvalue 1, i.e., ker (J — M) = span(1). 


Lemma 1 For iterative consensus procedure (3) it holds 
k_ 7? kl yo _ +? 
[x*- 2], = @aeomyt |x? - 2°], 


Proof Let x € R” andx = 711" x. First, note that Mx = M.- 711'x = 711'x = 
x. It can be easily seen that x — ¥ € (span(1))+ and Mx — x € (span(1))': 


(x —x,1) = (1 = “i1") x, i = (x, (1 = “11") i = 0, 
m m 
(Mx — x, 1) = ((m = yt") ae 7 = (x, (m = ~11') i = 0. 
m m 
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On subspace (span(1))" the largest eigenvalue of M is A2(M). We have 
|| Mx — xl. = IM — x) ly < A2(M) |lx — Xl. 


Applying the derived fact to every column of X* we get x = X° and 
| XatI _ a |, < A2(M) | xk _x° |, for every k > 0, which concludes the 


proof. 


By Lemma 1, consensus scheme (3) requires O (an log(4)) iterations to 
achieve accuracy &, i.e., to find arithmetic mean of vectors over the network with 


bt =0 
precision €: | xk _X |, <6. 


Remark If the graph changes with time, we associate a sequence of mixing matrices 
{M* (aa with it. In the time-varying case, the consensus algorithm convergence rate 


is ruled by worst-case second largest eigenvalue, i.e., max A2(M*). Provided that 
k>0 


each Mé is symmetric, doubly stochastic and satisfies the decentralized property in 
Assumption 1, the number of communication rounds to reach consensus accuracy € 


: 1 1 
is O (iw log :) 


2.1.1 Quadratic Optimization Point of View 


For a given undirected graph Y = (V, E) introduce its Laplacian matrix 


—1, if, j) € E, 
[W]ij = }deg(i), ifi = j, 
0 otherwise. 


Laplacian matrix is positive semi-definite and for X = (x, .. Xm)! it holds WX = 
0S xj =... = Xm. A more detailed discussion of Laplacian matrix and its 
applications is provided in Sect. 3.3. The consensus problem can be reformulated as 


ie ae 
emnin, 800 = (x, WX). (4) 


N 


Any matrix X* with equal rows is a solution of Problem (4), and, therefore, the set 
of minimizers of Problem (4) is a linear subspace of form 2* = {1x7 : x € R" }. 
Denote Amax(W) and Wes (W) the largest and the smallest non-zero eigenvalues of 


W, respectively. Then g(X) has Lipschitz gradients with constant Amax(W) and is 
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strongly convex on (ker W)+ with modulus oe (W). Let non-accelerated gradient 
descent be run over function g(X) 


{ «= 
XO oe 2 ey. (5) 
Amax(W 


First, note that trajectory of method (5) stays in X° + (ker W)+. To verify this, 
consider Z € ker W: 


(Wx", z) = (x*, " = 0 = WX" € (kerW)-, 
N 


A = FO ee K€ (kerW)t. 
im k= 


Method (5) converges to some point in 2*. Since its trajectory lies in X° + 
(ker W)+, the limit point of {x* ye ,—0 1S the projection of X ° onto ker W, ie., to 
X’. The algorithm requires O = log(= ) iterations to reach accuracy €. 

In order to establish the connection between ee descent for problem (4) 


and consensus algorithm (3), introduce M = [J — : aan" Matrix M satisfies 
Assumption 1, and update rule (5) rewrites as 
yetl — xk W a xk — mx 
Amax(W) Amax(W) 


Therefore, gradient descent on g(X) with constant step-size is equivalent 


1 
7 7 Amax (W) we 
to non-accelerated consensus algorithm (3). Moreover, the iteration complexities 


fo i a 
coincide, since 42(M) = 1 — fa and, therefore, O Gan toe(})) = 


iia ar. (W) 
1 1 
O (baw boat). 
We note that the same gradient descent analogy holds for time-varying networks. 


Given a sequence of connected undirected graphs {g ' |e consider a sequence of 


—— [o,@) 
corresponding Laplacians | w* | . The consensus algorithm may be interpreted as 


where gk (X) 


a gradient descent on a time-varying quadratic function {g* (XxX) hare 
is defined as 


gk(X) = (x, w'x) (6) 


NIle 


All g*(X) have a common set of minimizers {1x! : xe R‘}. The worst-case Lip- 


schitz constant over time is max Amax(W ), and consensus iteration writes similar 
k>0 
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to (5) with Amax(W) replaced by ses max (Ww). The convergence guarantees of 
> 


gradient descent over a time-varying function { ok (Xx Sem do not break since the 
Lyapunov function for non-accelerated gradient dynamics is the squared distance to 
the solution set, i.e., it does not depend on the minimization objective. Therefore, 
non-accelerated gradient descent decreases the Lyapunov function at each step and 
is robust to changes in the objective function. The number of communication rounds 
max Amax w ) 
to reach consensus accuracy é€ is O a= log :) 
EG min W) 
In order to obtain a better dependence on aCe Nesterov acceleration [116] 
may be employed. Consider Nesterov accelerated method for strongly convex 
objectives 


_ VAmax(W) = Amin) 


p= , (7a) 
VAmax(W) + / at. (W) 
Vax 4 se a x, (7b) 
xe a yk_ __ W_ pk (7c) 
Amax(W) 


Analogously to non-accelerated scheme (5), the trajectory of accelerated Nesterov 
method lies in X° + (ker W)+. This can be easily seen by induction: 


yk — x9 = (x* — x) + pcxt — x°) — (x*! — x) € (ker W)-, 


1 = = 
XMtl _ y0 = yk y®*)-____s wy sg (ker W)t. 
——amax(W) 
€(ker W)+ eImW=(ker W) 


Therefore, accelerated scheme converges to the projection of X° onto ker W, which 


1S x. i.e., a matrix in which rows are arithmetic averages of X 0. 

Note that acceleration is not attainable over time-varying graphs. Imagine we run 
Nesterov gradient method over a time-varying objective { gk (X) eae defined in (6). 
The potential function for accelerated gradient dynamics [10] includes the objective 
function. Since the objective function changes, the potential function is also time- 
dependent and this fact breaks the prove of convergence result. A formal proof of 
why acceleration is impossible over time-varying networks that stay connected and 
undirected at each iteration is provided in [76]. 
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2.1.2 Chebyshev Acceleration 


As shown in Sect. 2.1.1, the convergence of consensus algorithm depends on the 
Amax (W 

Ain ws : 
bounds for decentralized algorithms [135] and represents the measure of the 
communication graph connectivity. There exists a technique called Chebyshev 
acceleration that enhances the dependence on x by replacing the communication 
matrix W with a Chebyshev polynomial Px (W). The structure of the polynomial 
ensures that the condition number of Px(W) is O(1) whence its power is K = 
L./x]. In other words, multiplication by Px (W) requires L./x] communication 


rounds, but the condition number is reduced from x to O(1). A multiple Px (W)X 


condition number x = The factor x also appears in upper complexity 


is computed via an iterative consensus-based process. Introduce cz = ae dag = 
1, a = @, c43 = ——*,—, X®° = X, X! = el — 3W)X, for 


Amax(W) +A ,(W)? 


min 


t=1,...,K —l1do 
Ar41 = 2c2a; — ay-1, xttl a 2co.(I — c3W)X! — x 


and return X° — X¥ /ag. 

Summing up, a consensus algorithm of form X k+l = ({ — Px(W) /Amax 
(Pr (W)))X requires a total of O(,/x log 1/e) communications to achieve accu- 
racy €. This complexity is better than O(x log 1/¢) that corresponds to standard 
consensus algorithm X*+! = (J — W/Amax(W))X*. Chebyshev acceleration was 
used to obtain first optimal decentralized methods in [135]. 


2.1.3 Summary 


In this section, we covered consensus algorithms over time-static and time-varying 
undirected graphs. 

Firstly, let us cover the time-static networks. Different types of matrices may 
correspond to the graph: mixing matrix M or Laplacian matrix W. The key 
difference is that mixing matrix is doubly stochastic, i.e., M1 = 1 and I’M=1', 
while the Laplacian has a null-space property W1 = 0. Consensus step based on 
mixing matrix is just multiplication by M; therefore, it does not require additional 
knowledge of mixing matrix spectrum. Concerning Laplacian W, we can either 
build a mixing matrix J — W/Amax(W) or use a quadratic minimization approach 
(see Sect. 2.1.1). In both cases, knowledge of Amax (W) is required. 

Secondly, acceleration techniques are also applicable to consensus iteration 
schemes, but only in the time-static case. We can use Chebyshev acceleration 
covered in Sect. 2.1.2 or employ an accelerated Nesterov method to a quadratic 
problem (Sect. 2.1.1). In both cases, we improve the iteration complexity from 


O (x log (4)) to O (Vx log (2). However, acceleration is not attainable in 
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the time-varying case. When the graph changes, we can only run non-accelerated 
consensus. 


2.2 Main Assumptions on Objective Functions 


In this section, we introduce basic assumptions on the functions locally held by 
computational entities in the network. 


Assumption 2 For every i = 1,...,m, function f; is differentiable, convex, and 
L;-smooth (L; > 0). 


Assumption 3 For every i = 1,...,m, function fj is jzj;-strongly convex (44; > 0). 
Under Assumptions 2 and 3 for any x;, yj € R” fori = 1,...,m it holds 
Li 2 Li 2 
7 llyi — xl < iw) — fii) — (VF Oa), vi — xi) < o% Lyi — xi |I5 - 
Summing the above inequality oni = 1,...,m we obtain 
m 
nj — 


WY — X13 < 0S ys — ail < FO) — FOX) — (VFO, ¥ - X) 
i=1 


2 _ max; L; 2 
23 -IIyi — xl S IY — X18. 


On the other hand, given that X,Y € C,ie.,xy =... = Xm, y) =... = Ym, we 
have 


1 m l m 
Fy Dei WY — XIIg S FY) — FOX) — (VF(X),Y —X) < 5 DULY — X18. 
i=l i=l 


Therefore, F(X) has different strong convexity and smoothness constants on R”*” 
and C. Following the definitions in [135], we introduce 


* (local constants) F(X) is juj-strongly convex and L)-smooth on R”*¢, where 
Mi = min pj, Lj = max Lj. 
L L 
¢ (global constants) F(X) is 4¢-strongly convex and L,-smooth on C, where ig = 
g& Leg gly g Mg 
x j=) Mi, Le = t int Li- 


m m 


Note that local smoothness and convexity constants may be significantly worse than 
global, ie., L7 >> Lg, wi K Mg (see [135] for details). We denote 


Ky = —, Kg S35 (8) 
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the local and global condition numbers, respectively. A trick proposed in [135] 
allows to improve the local condition number by slightly changing functions fj. 
Namely, introduce f;(x) = f;(x) — A FE |Ix|5 instead of f;. Then the local 
condition number writes as 


= max; (Ly — 
4 a (Lx He) 


K] 1. 
Meg 


2.3 Distributed Gradient Descent 


Distributed gradient methods alternate taking optimization updates and information 
exchange steps. One (synchronized) communication round can be represented as a 
multiplication by a mixing matrix compatible with the graph topology. One of the 
first distributed gradient dynamics studied in the literature [113, 173] uses a time- 
static mixing matrix and writes as 


m 
ahh? = SY IMIijx =a*V £ Oe). 
j=l 


Using the notion of X = (x, ....xm)! the above update rule takes the form 
XM! = MX* — aVF(X"), (9) 


which is a combination of two step types: gradient step with constant step-size a 
and communication round with mixing matrix MX“. In [173] the authors showed that 
function residual f (x") — f(x*) in iterative scheme (9) decreases at O(1/k) rate 
until reaching O(q@)-neighborhood of solution. 

Method (9) does not find an exact solution in the general case. We follow the 
arguments in [143] to illustrate this fact. First, note that X is a solution of (2) if and 
only if two following conditions hold. 


1. (Consensus) X= MX. 
2. (Optimality) 11 V F(X) =0. 
Let X° be a limit point of (9). Then 


x —= MX” —e@VF(X™). 


Consensus condition yields X¥° = MX, ie., X° has identical rows [X]?° = 
x. Therefore, VF(X°) = 0, which means V fj (x) = ... = Vfin(x) = 0. 
Consequently, x°° is a common minimizer of every f;, which is not a realistic case. 

Method (9) is a basic distributed first-order method. Its different variations 
include feasible point algorithms [93] and subgradient methods [113] (actually, the 
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latter work initially proposed scheme (9)). Extensions to stochastic objectives and 
stochastic mixing matrices have been addressed in [2, 75, 101]. 


2.4 EXTRA 


Distributed gradient descent (9) is unable to converge to the exact minimum of 
(2), which is the major drawback of the method. An exact decentralized first-order 
algorithm EXTRA was proposed in [143]. The approach of [143] is based on using 
two different mixing matrices. Namely, consider two consequent updates of type 


(9). 


Xe? = MX! _ av F(X), (10) 


Xt! _ MX* — aVF(X*), (11) 


where M is mixing matrix, i.e., M = (M+ /) /2 as proposed in [143]. Subtracting 
(11) from (10) yields 


x2 _ y+] Mx! _ xt —a [vr - VF(X)] (12) 
thus leading to an algorithm 


Algorithm 14 EXTRA 


Require: Step-size a > 0. 
X! =MX® — aVF(X°) 


end for 


Let X™ be a limit point of iterate sequence {x a ae generated by (10), (11). 
Then 


xXx — X° =MX™ — MX™ — a[VF(X™) — VF(X™)], 
Z 1 
(M—M)X® = 5 (Mx* =P tae a 


The last equality means that X°° is consensual, i.e., its rows are equal. On the other 
hand, rearranging the terms in (12) and taking into account that X' = MX? — 
aV F(X°) gives 


k+1 
xe? — MX! _avF(Xehy4 YM —M)X’. 
t=0 
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Multiplying by 1' from the left yields 


k-1 
Tx"! Sei Vex) + > 1'(M—™M)x' 
t=0 


yi xk =1 


= 1'Mx**!_ qi! VF(X*t!) 
and taking the limit over k — oo we obtain 
1' VF(X®) =0 


which is the optimality condition for point X°. Therefore, a limit point of {x 6 je 
generated by Algorithm 14 is both consensual and optimal, i.e., is a solution of (2). 

In the original paper [143] Algorithm 14 was proved to converge at a O(1/k) 
rate for L-smooth objectives and achieve a geometric rate O(C —k) (where C < 1 is 
some constant) for strongly convex smooth objectives. In [95] explicit dependencies 


on graph topology were established. Namely, EXTRA requires 
E LR? + M?/L 
O (2 + x) log ered iterations for strongly convex smooth objectives, 
KI é 


L x 
O (4 + x) log((L Ro+ Ww /L)x)) iterations for (non-strongly) convex smooth objectives, 
€ 


where 
= (13) 
A= TAQ)’ 
d2(M) denotes the second largest eigenvalue of mixing matrix M, || X° — X* iF < 


mR, X*|15 < mR’, IV f (X*) 5 < mM?. The term x characterizes graph 
connectivity. A similar term, also referred to as graph condition number, is used 
in Sect. 3.3 for graph Laplacian matrix. Graph condition numbers based on mixing 
matrix and Laplacian have the same meaning, as discussed in Sect. 2.1. 


2.5 Accelerated Decentralized Algorithms 


Performance of decentralized gradient methods typically depends on function (local 
or global) condition number « and graph condition number x defined in (13). 
For non-accelerated dynamics [143, 173] complexity bounds include « and x. 
Improving dependencies to ,/« and ,/x is an important direction of research in 
distributed optimization. This can be done by applying direct Nesterov acceleration 
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[116] or by employing meta-acceleration techniques such as Catalyst [99]. The two 
major approaches studied in the literature are primal and dual algorithms. 

Dual methods are based on a reformulation of problem (1) using a Laplacian 
matrix induced by the communication network. This reformulation is discussed in 
Sect. 3.3 in more detail. The basic idea behind dual approach is to run first-order 
methods on a dual problem to (2). Every gradient step on the dual is equivalent 
to one communication round and one local gradient step taken by every node 
in the network. In [135], algorithms using Chebyshev acceleration that achieve 
O (JKIX log(1/ €)) communication complexity are proposed. On the other hand, 
lower complexity bound for deterministic methods over strongly convex smooth 
objectives is (2 (/Kex log(1/e)), as shown in [135]. 

In dual approach, one may run non-distributed accelerated schemes on dual 
problem and obtain accelerated complexity bounds, i.e., ./«x. For primal-only 
methods this is not the case, and primal algorithms have to alternate optimization 
and consensus steps in a proper way and employ specific techniques such as gradient 
tracking. A direct distributed scheme for Nesterov accelerated method was proposed 
in [126]. 


Algorithm 15 Accelerated distributed Nesterov method 
Require: Starting points X° = Y° = V°, 5° = VF(X®), step-size 7 > 0, momentum term 


xe My st 
vit! = (1—a)MV* +oMY* — 25% 


yet! — Xia yet 
a I+a@ 
SK = MS*'§ 4+ VE(YH*]) — VE(Y*) 
end for 


In Algorithm 15 quantity S*+! stands for a gradient estimator. The information 
about the gradients held by different agents is diffused through the network via 
consensus steps, i.e, MS* multiplication. Every node stores one row of S* which 
approximates the average gradient over the nodes in network: 


1 m 
sf — DV AiO). 
k=1 


This technique is referred to as gradient tracking and is employed in several primal 
decentralized methods [3, 74, 112, 124-126, 170]. 

Algorithm 15 requires O(x/ 2° a log(1/e¢)) computation and communication 
steps to achieve accuracy ¢, which does not match optimal bounds. EXTRA 
acceleration via Catalyst envelope [95] requires O(,/«;x log x log(1/e)) iterations 
for smooth strongly convex objectives. Recently a new method Mudag which unifies 
gradient tracking, Nesterov acceleration, and multi-step consensus procedures was 
proposed in [170]. It has 
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O VK ¢ log 2 computation complexity and 


L 1 
O ( /K 2X log (F+.) log (<)) communication complexity. 
2 € 


Mudag reaches optimal computation complexity and optimal communication com- 
plexity up to log (Zee) term. A valuable feature of the method is that it has 
& 


dependencies on global condition number «; instead of local «,g. In the general 
case, global condition number may be significantly better. A proximal version of 
Mudag method for composite optimization is studied in [171]. The method in [171] 
requires an optimal O(,/«gX log(1/e)) number of computations and matches the 
lower communication complexity bound up to a logarithmic factor. Global condition 
number is also utilized in paper [134] where an inexact oracle framework [28, 29] 
for decentralized optimization is studied. The latter work is discussed in more detail 
in Sect. 2.7. Finally, in [77] the authors proposed a primal-only method OPAPC 
which reaches both optimal computation and communication complexities (up to 
replacing Kg with x). 

Chebyshev acceleration is widely used to obtain optimal decentralized algo- 
rithms. For example, in [77] the authors propose method APAPC, which has 


is 1 : : we 
O Ke [Gx + x) log 1) communication and computational complexities. After 


that, the authors replace Laplacian W with a Chebyshev polynomial Px (W), 
which results in x(Px(W)) = O(1), but every communication round costs 
O(./x) communication rounds. Therefore, APAPC is modified to a new method 


: bh 1 : [L 1 
OPAPC, which has O (/4 log 1) oracle per node complexity and O ( me log 1) 


communication complexity. In this particular case, Chebyshev acceleration not 
only allows to achieve optimal complexity bounds but also separates oracle and 
communication complexities of the algorithm. 

Paper [144] proposed OGT, a method based on loopless Chebyshev accelera- 
tion scheme. On the contrary to classical Chebyshev acceleration (used, i.e., in 
OPAPC [77]), the loopless technique does not require multiple communication 


steps at each iteration. OGT requires O (/4 log 1) oracle calls at each node and 


O ( 7 x log 1) communication steps, which meets the lower bounds. 

Moreover, a recent work [145] showed that Nesterov acceleration can also be 
applied for distributed optimization over directed graphs. Their algorithm APD has 
communication complexity ~ 1/,/e for non-strongly convex objectives and APD- 
SC has ~ ./Z/ log(1/e) complexity for strongly convex tasks. As stated in [144], 
in the case of undirected graphs the explicit dependence on network characteristics 


: Se A : : L 3/2 1 
is attained: the complexity of APD-SC writes as O ( ah /* log 1). 
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2.6 Time-Varying Networks 


In the time-varying case, the communication network changes from time to time. 
In practice these changes are typically caused by malfunctions such as loss of 
connection between the agents. The network is represented as a sequence of 
undirected communication graphs {g = (V, an sire and every graph Y* is 
associated with a mixing matrix M*. The algorithms capable of working over time- 
varying graphs must be robust to sudden network changes. A linearly convergent 


method DIGing was proposed in [112]. 


Algorithm 16 DIGing 


Require: Step-size a > 0, starting iterate X°, Y9 = VF(X°) 
fork =0,1,...do 
xkt! — ME Xk — gy 
yktl = M‘yk 4 VE(X*!) = VF(X*) 
end for 


DIGing incorporates a gradient-tracking scheme and achieves linear convergence 
under realistic assumptions such as B-connectivity (i.e., a union of any B conse- 
quent graphs is connected). In [152] the authors propose an algorithm which utilizes 
specific convex surrogates of local functions and local functions similarity in order 
to enhance convergence speed. In [96] a gradient-tracking technique combined with 
Nesterov acceleration was employed to construct an accelerated method AccGT 
over time-varying B-connected networks. 

Another class of time-varying networks are graphs that stay connected at each 


iteration. For this type of problems, denote Laplacian at k-th iteration w* and define 


condition number x; = mary Amax(W. ) The lower bounds for this class of problems 
ming At i,(W) 

are O (JK; log(1/e)) for the number of (local) computations and O(x VK} log(1/e)) 

for the number of communications [76]. AccGT [96] and ADOM+ [76] are optimal 


algorithms using primal oracle, and ADOM [78] is an optimal dual method. 


2.7 Inexact Oracle Point of View 


In [134] the authors study an algorithm which alternates making gradient updates 
and running multi-step communication procedures. Introduce 


= 1 1 m 
X=—11' X= Me(X) = @...%)', wheex=— > xpand = 1.:, 1)", 
m m i=l 
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Also define an average gradient over nodes V F(X) = 1/m yl V fi (xi). Consider 
a projection gradient method with trajectory lying in C 


Xx41 = X~ — BVF (Xx). 


In a centralized scenario, the computational network is endowed with a master 
agent, which communicates with all agents in the network. The master node is able 
to collect vectors x; from every node in the network and compute a precise average 
x. In decentralized case the master agent is not available, and, therefore, nodes 
are only able to compute an approximate average using consensus procedures. The 
network is allowed to change with time and the sequence of corresponding mixing 
matrices is restricted to the following assumption. 


Assumption 4 Mixing matrix sequence {Mk} satisfies the following proper- 


: k=0 

ties. 

e (Decentralized property) (Gj, 7) ¢ Ex => [M*]i; = 0. 

* (Double stochasticity) M‘1 = 1, 1’M*é = 17 

¢ (Contraction property) There exist t € Zi4 and A € (0, 1) such that for every 
k >t — 1itholds 


[Mix -X| <a-a|x-X],, 


where Mk = Mk... Mé-tt+!, 


Algorithm 17 Consensus 


Require: Initial X° € C, number of iterations T. 
fort =1,...,7 do 
xttl =M x! 
end for 


Algorithm 18 Decentralized AGD with consensus subroutine 


Require: Initial guess X° €C, constants L, p > 0, U? = X°, a = M° =0 


Ak+1 


1: fork =0,1,2,...do 
23 Find okt! as the greater root of (AF +a! 4 Ak) = L(akt!)2 
3: AMT = Ak 4 ght! 
, yet - aCe me AK Xk 

< At 

yl +(1 Ak uk k+1 
5 vat = li hs ) a VFE(Y*k+!) 
1+A*%u+yp 14+A*u+ypu 

6: U*! = Consensus(V*t!, T*) 
: itl z aktlyk+l + AK Xk 
8 
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2.7.1 Inexact Oracle Construction 


Trajectory of Algorithm 18 lies in the neighborhood of constraint set C. It is 
analyzed in [134] (based on the technique developed in [132]) using the notation 
of inexact oracle [28, 29]. Algorithms of this type have been analyzed in time- 
static case [63] using inexact oracle notation, as well. Let h(x) be a convex 
function defined on a convex set Q C R”. Ford > 0, L > mw > O, a pair 
(hs,t,u(*), 85,L,u(%)) is called a (6, L, )-model of h(x) at point x € Q if for 
all y € Q it holds 


wb L 
5 ly = 213 SAO) = (ho,2.u0) + (55,2,n00, 9 — x) <5 lly — 413 +8. 
(14) 
The inexactness originates from computation of gradient at a point in neighborhood 


of C. The next lemma identifies the size of neighborhood and describes the inexact 
oracle construction. 


Lemma 2 Define 


N 


7 1 as 1 2G Wie: 2 
fr,.tuO, X) =~ F(X) +(VF(X), X—X)+=[ mw |x —x|5]. 


1 n 
88,L,u@ X) = — DV fila). (15) 
i=1 


Then (fs,t,u%, X), 85,L,u(%, X)) is a (6, 2Lg, g/2)-model of f at point x, i.e., 


Mg, _ = as 
7 Iv — X15 < £O) — fo,c,u@%, X) — (95,2,u@, X), 9 —X) < Le ly — X13 +6. 


The inexact oracle defined in Lemma 2 represents a (5, 2L¢, (4g /2)-model of F. 
Note that it uses global strong convexity constants instead of local ones. Global 
constants may be significantly better for method performance, as pointed out in 
[135]. The lemma relates the projection accuracy 6’ to inexact oracle parameter 6. 


2.7.2 Convergence Result for Algorithm 18 
First, Lemma 2 states that (6, 29, 4g/2)-model of F is obtained if the gradient is 


computed in 6’-neighborhood of C. In order to achieve this 6’-neighborhood, one 
needs to make a sufficient number of consensus (Algorithm 17) iterations. 
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Lemma 3 Let consensus accuracy be maintained at level 6’, i.e., 


6’ for j =1,...,k and let Assumption 4 hold. Define 


+2 
vf 
- 


21) Li - 2 gs \2 QIVF(X*)Il 
vD = ( +1) Vi4 val Te a +7) a Bea 
VL pL 2 JS~Ly VL 


Then it is sufficient to make Ty = T = x log gL consensus iterations (where Tt 


and i are defined in Assumption 4) in order to obtain consensus with 8'-accuracy 


=s 2 
on stepk +1, ie, |UFT! — oy ee 


A basis for the proof of Lemma 3 is a contraction property of mixing matrix 
ky oo : 
sequence {M } ,—0 (see Assumption 4). 
Second, provided that projection accuracy on every step of Algorithm 18 is sus- 
tained at level 5’, the algorithm turns into an accelerated scheme with inexactness. 
Its convergence rate is given by the following 


. sa 2 
ui —U! |; <8 forj= 


Lemma 4 Provided that consensus accuracy is 6’, i.é., 


1,...,k, we have 


Lees HED 


f@)-f@%)< 


2Ak Ak 
2 k j 
jor arf < Mae , ha 
27— 1+Aku 1+ Aku 


where 6 is given in (15). 


Finally, putting Lemmas 3 and 4 together yields a convergence result for 
Algorithm 18. 


Theorem 5 Recall the definitions of t and x from Assumption 2, choose some € > 0 
and set 


3/2 

D 
k=T= . log ae ges at ; 
2n r) 32 py! Le 


Also define 


ie —x* 


L 
D, = = [sau 
&§ 


(24 AVII|VF(X*)Il> (#)" 
2\eg Jn Meg 


L, L,\'/4 


Mg 
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Then Algorithm 18 requires 


2 
_ 4 Ls |“ — x* |, 
N=2 [ 3 log a (17) 


gradient computations at each node and 


= 2 
Lgt 2L¢ | — x*|, Di 
Nrot = N-T =2,|—— - log log {| —=+ D2 (18) 
Hg i é Je 


communication steps to yield X™ such that 


—_nq2 
FON) = far) se, |x" -¥"] <0" 

In the time-static case, contraction term t/A turns into x(M), and an 
accelerated consensus procedure of type (7) may be employed. This results in 


a better dependence on graph connectivity and leads to a complexity bound 


O (V ey x (M) log”(4)) which is optimal up to a logarithmic term. Similar 
results are attained in works which use penalty-based methods [51, 94, 133] (see 
Appendix B in [51]) for details. 


Remark The analysis of Algorithm 18 presented in [131] results in constants Ly, Wg 
in the complexity bound. These constants are better than local constants L;, jz; but 
still can be improved. The inexact oracle concept allows to reduce decentralized 
optimization problem to minimization of f(x) over R¢ with inexact oracle. 
Therefore, the complexity will depend on constants L ¢, 4 ¢ which characterize f 
itself, not its flattened variant F’. An accurate analysis on this issue is presented in 
Sect. 2.8. 


2.7.3 Stochastic Decentralized Optimization 


The technique used in Algorithm 18 can be extended to stochastic objectives. 
Following the definitions in [131], let f;(x) := Ex f,(x, &;), where &;’s are random 
variables. Variables &; represent the source of stochasticity in f; (x, &;) which may be 
caused by random sampling or stochastic noise. For each i = 1, ..., we assume 
that V f;(x, &;) is L;(€) continuous and there exists a constant L; > O such that 


Le, L; (&;)? < L; < +c0. Under these assumptions f; is L;-smooth. We also 
bound the variance of V fj (x, &): 


Re LIV A(x, &) — VA(XI3] < 02. 


Recent Theoretical Advances in Decentralized Distributed Convex Optimization 275 


Let us define ae = i YY OF The algorithm in [131] combines a consensus 
subroutine technique auiilae to Algorithm 18 and also uses a specific batch-size 
policy. In order to analyze the method, inexact oracle framework similar to that of 
Sect. 2.7 is used. The inexactness of gradient has two sources: inexact projection 
onto the constraint set via consensus subroutine and stochastic noise. On the 
one hand, tuning the batch size allows to reduce the variance of the batched 
gradient at the cost of additional stochastic oracle calls. Therefore, proper batch 
size guarantees a balance between the stochastic gradient noise and the number of 
gradient calculations. On the other hand, the accuracy of the consensus is tuned 
by the choice of number of consensus iterations. Choosing a proper batch size and 
number of consensus iterations allows to obtain optimal complexities both in the 


number of computations and communications up to a logarithmic factor. Namely, 
2 


the method in [131] requires O (max “2 at log i |) oracle calls per node. In 
& 


N[Lgé’ 


: : : fe L ee 
the time-varying case, it requires O (%, / 7 ) communication rounds (where t and A 
§ 
are defined in Assumption 4), and in time-static case its communication complexity 
~( IL : ‘ ‘ i 
takes the form O (, / ie x ) and is achieved by using Chebyshev acceleration. 
& 


2.8 Decentralized Saddle-Point Problems 


Along with minimization problems, sum-type min-max problems of type 


metewe => iEw) 
min max f(x, y) :=— (x, y), 
xER yey m i=] a 2 


where 2 and Y are convex compacts, can be solved in a decentralized manner, as 
well. The same way as in Assumptions 2, 3 for minimization tasks, we introduce 
assumptions for min-max problems. 


Assumption 6 For every i = 1,...,m, function fj is differentiable, convex in x, 
concave in y and L;-smooth. 


Assumption 7 Function f is jz f-strongly convex in x, y ¢-strongly concave in y 
(wy > 0) and L ¢ -smooth. 


Saddle-point problems have many practical applications: classical and well- 
studied in economy and in game theory [40, 162], and modern in imaging denoising 
[24], in adversarial training [9, 50], and in statistical learning [1]. But distributed 
saddle-point problems are not as widely studied in the literature as the minimization 
problems. Let us highlight the main works devoted to decentralized min-max 
problems. Most of the works are devoted to decentralized algorithms on fixed 
graph topology. In paper [20], the authors present lower bounds for deterministic 
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decentralized saddle-point problems under Assumptions 6 and 7. These estimates 
are as follows 


L 
2 (= log (/ ®) computation complexity and 
Mf 
Ly De as : 
Q | ./x — log (1/2) communication complexity. (19) 
Mf 


Additionally, the paper provides an optimal algorithm (up to logarithmic factors), 
which achieves the lower bounds. Among the disadvantages of the algorithm 
presented in [20], one can single out multiple gossip steps; this approach is unstable 
and not a popular in practice. A similar algorithm with multiple gossip steps is 
proposed in [103], but they consider convergence in the non-convex case (under the 
minty condition [66, 108]). Also, this work shows the effectiveness of decentralized 
training of GANs. [105] is also devoted to minty non-convex saddle-point problems. 
It is also interesting to note the work [20] on saddle-point problems in terms of data 
similarity. The work gives lower bounds for communication complexity, as well as 
optimal algorithms for such setting of the problem. In particular, the lower and upper 
bounds state that 


é cee 
J/x (1+ — ) log /e) communication rounds 
Mf 


are enough to achieve ¢-precision. Interesting to note that for uniformly distributed 
data, with high probability, L; ~ Ly and 6 ~ O(max; Li//n), where n—the 
number of local samples on each node. This means that the data-similarity bounds 
on communication rounds are significantly better than the general one (19). 

It is also important to mention the works devoted to saddle-point problems on 
time-varying networks. In particular, paper [18] is devoted to lower bounds and 
optimal algorithms for connected topology. Work [14] is devoted to the broader 
case of time-varying networks, for example, methods can do local steps (iterations 
without communication). 

An interesting variant of saddle-point problems is extensions to local and global 
variables, i.e., problems of the form 


m 


m 
fin nee SAG Ped: (20) 
pilxi¥iiy rtyibihy mM 
Such problem formulations cover minimization tasks with separable and semi- 
definite constraints [107], decentralized reinforcement learning [163] and dis- 
tributed computation of Wasserstein barycenters [36, 130]. A subgradient method 
for problems of type (20) was proposed in [107]. The method has a O(1//N) 
convergence rate. A recent work [130] proposed a method based on Mirror-Prox, 
capable of working in general proximal setup, reaching a O(1/N) convergence 
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rate and an accelerated rate on x. The method achieves optimal oracle and 
communication complexities in Euclidean convex-concave case over time-static 
graphs. 

The state of the art results for stochastic decentralized strongly convex-concave 
saddle-point problems (with variance reduction, for time-varying networks) are 
presented in the very recent paper [79]. 


3 Convex Problems with Affine Constraints 


In this section!, we consider convex optimization problem with affine constraints 


min f(x), (21) 
Ax=0,xEQ 


where A > 0, KerA + {0} and @Q is a closed convex subset of R”. Up to a sign, the 
dual problem is defined as follows: 


min w(y), | where (22) 
g(y) = max {(y,x) — fQx)}, (23) 
xEeQ 


u(y) = g(Aly) = max {(y, Ax) — f(x)} = (Aly, x(A'y)) — f(x(A'y)), 


(24) 


where x(y) = argmaxy<g {(y, x) — f(x)}. Since KerA # {0} the solution of the 
dual problem (22) is not unique. We use y* to denote the solution of (22) with the 


def 
smallest €2-norm Ry = Ily*ll2. 


3.1 Primal Approach 


In this section, we focus on primal approaches to solve (21) and, in particular, the 
main goal of this section is to present first-order methods that are optimal both in 
terms of V f(x) and A! Ax calculations. One can apply the following trick [32, 44, 
51] to solve problem (21): instead of (21) one can solve penalized problem 


R2 
min} F(x) = f(x) + Zant (25) 
xEQ E 


! The narrative in this section follows [51]. 
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where ¢ > 0 is the desired accuracy of the solution in terms of f(x) that we want 
to achieve (see the details in [51]). 

Next, we assume that f is jz-strongly convex, but possibly non-smooth function 
with bounded (sub) gradients: ||V f(x) |l2 < M forall x € Q. In this setting, one can 
apply Sliding algorithm from [85, 86] to get optimal rates of convergence. The 
method is presented as Algorithm 19 and it is aimed to solve the following problem: 


ae) = h(x) + fx}, (26) 


where h(x) is convex and L-smooth, f(x) is convex but can be non-smooth, and 
x* is an arbitrary solution of the problem. In this case, it is additionally assumed 
that f(x) has uniformly bounded subgradients: there exists non-negative constant 
M such that” ||V f(x)||2 < M for all x € Q and all subgradients at this point 
V f(x) € Of (x). The key property of Algorithm 19 is its ability to separate oracle 
complexities for smooth and non-smooth parts of the objective. That is, to find such 
x that W(x) — W(x*) < e Sliding requires 


LR? . 
—— ] calculations of Vh(x) 
€ 


and 


— | calculations of V f(x), 


where R = ||x° — x*|l2. 
Now, we go back to the problem (25) and consider the case when jz = 0. In these 
settings, to find x such that 


F(x) — F(x*) <e (27) 


one can run Algorithm 19 considering f (x) as the non-smooth term and R;/e|| Ax II5 
as the smooth one. In this case, Sliding requires 


calculations of A! Ax, (28) 


; / Amax(AT A) R2R2 
e2 


? For the sake of simplicity, we slightly abuse the notation and denote gradients and subgradients 
similarly. 
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Algorithm 19 Sliding algorithm [85, 86] 


Input: Initial point x9 € Q and iteration limit NV. 
Let By € Bra, VR € Hy, and  ¢ N,k = 1,2,..., be given and set Xp = x9. 
fork =1,2,...,N do 
1. Set x, = (1 = ye)Xx-1 + Vexx—-1, and let hy(-) = I, (Xz, +), where 1, (x, y) = A(x) + 
(VA(x), y — x). 
2. Set 


(xx, XK) = PS(hy, Xe-1, Bes Tk). 


3. Set X, = (1 — yve)¥R-1 + VEX. 
end for 
Output: xy. 


The PS (prox-sliding) procedure. 

procedure: (xt, +) = PS(g, x, B, T) 

Let the parameters p, € R, + and 6, € [0, 1],t = 1,..., be given. Set ug = io = x. 
fort = 1,2,...,7 do 


uy = argmin { g(u) + p(t, 0) 4 Bie a et — a 3}. 
ueQ 2 2 

uy = (1—O,)uy-1 + Our, 
where / (x, y) = f(x) + (Vf (x), y — x). 
end for 
Set x? = ur andxt = iar. 
end procedure: 

M? R? 
O ( 5 ) calculations of V f (x). (29) 
& 


Next, we consider the situation when Q is a compact set, V f (x) is not available, 
and unbiased stochastic gradient V f (x, €) is used instead: 


Ee IVE@,6)1- VEGI), <4, (30) 
V f(x, €) — Ez [Vf (x, €)] | 
. o( f(x, €) | f(x 2h) < sec ais 


where 6 > Oando > 0. When 6 = 0, i.e., stochastic gradients are unbiased, one can 
show [85, 86] that Stochastic Sliding (S-Sliding) method can achieve (27) 
with probability at least 1 — B, B € (0, 1) requiring the same number of calculations 
of A! Ax as in (28) up to logarithmic factors and 


s (Mote 8 


5 ) calculations of V f (x, &). (32) 
€ 
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When y > O one can apply restarts technique for S-Sliding and get the 
method (RS-Sliding) [32, 159] that guarantees (27) with probability at least 1 — 


B, B € (0, 1) using 


5 (_ [Pm ATA 
[LE 


calculations of A! Ax, (33) 


~ (M?+02 . 
O | ——— ) calculations of V f(x, &). (34) 
[Le 


We notice that bounds presented above for the non-smooth case are proved when 
Q is bounded. For the case of unbounded Q the convergence results with such rates 
were established only in expectation. Moreover, it would be interesting to study 
S-Sliding and RS-Sliding in the case when 6 > 0, i.e., stochastic gradient is 
biased. 


3.2. Dual Approach 


In this section, we assume that one can construct a dual problem for (21). If f 
is j4-strongly convex in £2-norm, then y and g have Ly,—Lipschitz continuous and 
L —Lipschitz continuous in £2-norm gradients, respectively, [69, 129], where Ly = 
Amax(AA)/p and Lg = !/u. In our proofs, we often use Demyanov—Danskin theorem 
[129] which states that 


Vv(y) = Ax(Aly), Vo(y) =xQy). (35) 


Moreover, we do not assume that A is symmetric or positive semi-definite. 

Below we propose a primal-dual method for the case when f is additionally 
Lipschitz continuous on some ball and two methods for the problems when the 
primal function is also L-smooth and Lipschitz continuous on some ball. In the 
subsections below, we assume that Q = R”. The formal proofs of the presented 
results are given in [51]. 


3.2.1 Convex Dual Function 


In this section, we assume that the dual function g(y) could be rewritten as an 
expectation, i.e., p(y) = Es [g(y, €)], where stochastic realizations y(y, &) are 
differentiable in y functions almost surely in €. Then, we can also represent w(y) 
as an expectation: y(y) = Ez [yw(y, &)]. Consider the stochastic function f(x, €) 
which is defined implicitly as follows: 
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OY, 6) = marly) — J Habs (36) 


Similarly to the deterministic case, we introduce x(y, &) = argmax, pn 
{(y, x) — f(x, €)} which satisfies Ve(y, €) = x(y, &) due to Demyanov—Danskin 
theorem, where the gradient is taken w.r.t. y. As a simple corollary, we get 
Vwr(y,£) = Ax(A' y). Finally, introduced notations and obtained relations imply 
that x(y) = Eg[x(y, €)] and Vir(y) = Eg lV¥(, 8). 

Consider the situation when x(y,&) is known only through the noisy obser- 
vations x(y,&) = x(y,&) + 6(y,&) and assume that the noise is bounded in 
expectation, i.e., there exists non-negative deterministic constant 6, > 0, such that 


‘e[8(y, EI], <8), Vy ER". (37) 


Assume additionally that x(y, &) satisfies the so-called light-tails inequality: 


. 3 Te 2 
De [es (Ee = 5 FO, 8)] ny] <exp(1), VyeR", (38) 


x 


where o, is some positive constant. It implies that we have an access to the biased 


stochastic gradient Vw (y, &) ol Ak (y, €) which satisfies following relations: 


| be [vvo.8)] —Vv(y) |, <6, VyeR'’, (39) 
e Te 2 
lee |evo.8- ahs) |; seat. Hee a 
v 


where © \/Amax(ATA)dy and oy % \/Amax(ATA)oy. We will use VW(y, &*) to 


denote batched stochastic gradient: 


: 1 rk 7 1 rk 
Wy. 8) = — D Vu. 8), Hy, 8) =~ D108"). (41) 


‘ [=1 l=1 


The size of the batch rg, could always be restored from the context, so, we do not 
specify it here. Note that the batch version satisfies (see the details in [51]) 


r| Vw, &| = Vu], <5, VxeR’, (42) 


[eve -2[ove.e5]f 
OCG/rp) 


exp 


<exp(1), Wx eR”. (43) 
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In these settings, we consider a method called SPDSTM (Stochastic Primal-Dual 
Similar Triangles Method, see Algorithm 20). Note that Algorithm 4 from [34] is a 
special case of SPDSTM when 6 = 0, 1.e., stochastic gradient is unbiased, up to a 
factor 2 in the choice of L. 


Algorithm 20 SPDSTM 
0 


Require: 5° = z° = y° = 0, number of iterations N, ag = Ao = 0 


1: fork =0,..., N do 

2: SetL =2Ly 

3: Set Api = Ag + ax41, where 2hmF 4 = Ap t+ ansi 
4 SAH = (Any tonsi2)/Aca 

5: ZN = ck — a VEG, BH) 

6: yktl = (Ary toni Ag 

7: end for 


Ensure: y, @% = gt yyy ant (ATH, &*). 


Below we present the main convergence result of this section. 


Theorem 8 (Theorem 5.1 from [51]) Assume that f is ~-strongly convex and 
IV f(x*)|l2 = My. Let > 0 be a desired accuracy. Next, assume that f is L ¢- 
Lipschitz continuous on the ball Br f (0) with 


= R Amax(AlA)Ry 
Rp = (ms z ; max dR; Ry ; 
Anv Amax(AT A) li 


where Ry is such that ||y*\|, < Ry, y* is the solution of the dual problem (22), 
and R, = ||x(A' y*)|l2. Assume that at iteration k of Algorithm 20 batch size is 


ee 
: oy, a In(N/B) a 
chosen according to the formula rg > max 1, ~——— }, where &, = ae 
HLR2 i : 
O<e< a 0<séd< wags and N > 1 for some numeric constant H > 0, 


G > OandC > 0. Then with probability > 1 — 4B, where B ¢€ (0, !/8), after 


~ My ‘ : max(A!A ~ 
N=O (y FEx(ATA)) iterations where x(Al A) = nay the outputs a 
and yN of Algorithm 20 satisfy the following condition: 


FE) -f09< fE)4+4O") se, AR ke < (44) 


€ 
Ry 
with probability at least 1 — 4B. What is more, to guarantee (44) with probability at 
least 1 — 4B Algorithm 20 requires 


2M2 
O (mn (3 id x(AlA)In (5 /xa70) ‘ [xara] (45) 
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calls of the biased stochastic oracle Vv(y, &), ie, X(y, €). 


3.2.2 Strongly Convex Dual Functions and Restarts Technique 


In this section, we assume that primal functional f is additionally L-smooth. It 
implies that the dual function w in (22) is additionally jy -strongly convex in 
y® + (KerA')+ where uy = *nin(4"A)/z [69, 129] and A. (A! A) is the minimal 
positive eigenvalue of A! A. 

From weak duality — f(x*) < w(*) and (24) we get the key relation of this 
section (see also [6, 7, 117]) 


FxX(ATy)) — fx*) < (Vv), y) = (Ax(ATy), y). (46) 


This inequality implies the following theorem. 


Theorem 9 (Theorem 5.2 from [51]) Consider function f and its dual function 
w defined in (24) such that problems (21) and (22) have solutions. Assume that al 
is such that \||Vw(y%)|l2 < &/Ry and y% < 2Ry, where € > 0 is some positive 
number and Ry = |\y*||2 where y* is any minimizer of . Then for xX = x(A' y) 
following relations hold: 


FO) =7@) S25, Ax b= — (47) 


where x* is any minimizer of f. 


That is why, in this section we mainly focus on the methods that provide optimal 
convergence rates for the gradient norm. In particular, we consider Recursive 
Regularization Meta-Algorithm from (see Algorithm 21) [43] with AC-SA? (see 
Algorithm 23) as a subroutine (i.e., RRMA-AC-SA?) which is based on AC-SA 
algorithm (see Algorithm 22) from [48]. We notice that RRMA-AC-SA? is applied 
for a regularized dual function 


7 A 0,2 
vO) = vO) + ally —y Il; (48) 


where A > 0 is some positive number which will be defined further. Function yy is 
A-strongly convex and Ly-smooth in R” where Ly = Ly +A. For now, we just 
assume w.l.o.g. that yr is (uy + A)-strongly convex in IR", but we will go back to 
this question further. 

In this section we consider the same oracle as in Sect. 3.2.1, but we additionally 
assume that 56 = 0, 1.e., stochastic first-order oracle is unbiased. To define batched 
version of the stochastic gradient we will use the following notation: 
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i 


ee 1 
Vy. 8.1) = — D VW 8), x8 n= —YIx0,8). 49) 
l=1 


t =1 


As before, in the cases when the batch size r; can be restored from the context, we 
will use simplified notation VW (y, &') and x(y, &’). 


Algorithm 21 RRMA-AC- SA? [43] 

Require: y°—starting point, m—total number of iterations 
1: vo —v, 59 < yy, T < ke z | 
2: fork =1,...,T do 
3: 


Run AC- SA? for ™/r iterations to optimize y_1 with }*! as a starting point and get the 
output 3 
_ k = ‘ 
45 Wily) — WO) +4 Di Ty — FG 
5: end for 


Ensure: 37. 


In the AC-SA algorithm we use batched stochastic gradients of functions wx 
which are defined as follows: 


eet 
V(y, &') = Fd WYK 6), (50) 
l=1 


k 
Vy, €) = V¥(y,€) +a — yy) $49) 2y— §). 


l=1 


Algorithm 22 AC-SA [48] 


Require: z°—starting point, m—number of iterations, ~y,—objective function 
Stee eee 


1: *ag 
2: fort =1,...,mdo 
2 4Ly 
eS gas VO re) 
t (=a A+) t-1 _, a(U-or)Aty) (1-1 
< 
: Ynd yt(—a2y. 248 "yaya * 
t aA St , U-araty -t-1 Cn t t 
5 ae AtYt Yind 7 A+ ‘ A+¥Vr MEE Oma g§ ) 
6: Vag <— a2! +(1— ay) x45! 
7: end for 
. m 
Ensure: yjo- 


The following theorem states the main result for RRMA-AC- SA? that we need in 
the section. 
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Algorithm 23 AC- SA? [43] 


Require: z°—starting point, m—number of iterations, ¥,—objective function 
1: Run AC-SA for 7/2 iterations to optimize yy, with z° as a starting point and get the output y! 
2: Run AC-SA for m/2 iterations to optimize ~;, with y! as a starting point and get the output y? 
Ensure: y’. 


Theorem 10 (Corollary 1 from [43]) Let yw be Ly-smooth and \y-strongly 
convex function and = © (Ly In? N)/N?) for some N > 1. If the Algorithm 21 
performs N iterations in total with batch size r for all iterations, then it will provide 
such a point ¥ that 


(51) 


Li \lyo —y*|ZlntN of n° N 
[io 1°.) =e ++}, 


N4 rN 


where C > O is some positive constant and y* is a solution of the dual problem (22). 


The following result shows that w.l.o.g. we can assume that function y defined 
in (24) is y-strongly convex everywhere with wy = dkin(A'A)/z. In fact, from L- 
smoothness of f we have only that w is 1y-strongly convex in y? + (Ker(AT))~ 
(see [69, 129] for the details). However, the structure of the considered here methods 
is such that all points generated by the RRMA-AC-SA? and, in particular, AC-SA 
lie in y° + (Ker(A‘))~. 
Theorem 11 (Theorem 5.4 from [51]) Assume that Algorithm 22 is run for the 
objective Wey) = Wy) +A a JaAly — 3 I with z° as a starting point, where 
29, pl, ..., 3* are some points from y° + (Ker(A"))~ and y® € R". Then for all 
t > Owe have yi, ,,z', ye ps (Ker(A™)). 


Corollary 1 (Corollary 5.5 from [51]) Assume that Algorithm 21 is run for the 
objective W(y) = W(y) +A eae Flips II5 with y® as a starting point. Then 
for all k > 0 we have $* € y0+ (Ker(A")). 


Now we are ready to present our approach* of constructing an accelerated 
method for the strongly convex dual problem using restarts of RRMA-AC-SA?. To 
explain the main idea we start with the simplest case: oF = 0,r = 0. It means that 
there is no stochasticity in the method and the bound (51) can be rewritten in the 
following form: 


VCLylly? = y*lla N _ VCLylIVYQ)I2 n° N 


Vu(y < 
IVeHlkb < We ie 


; (52) 


3 The overall number of performed iterations during the calls of AC- SA? equals N. 
4 This approach was described in [32] and formally proved in [51]. 


286 E. Gorbunov et al. 


where we used inequality ||Vw(y°)|| > Lylly® — y*|| which follows from the 
[ty-Strong convexity of wy. It implies that after N = O(/Ly/uy) iterations of 
RRMA-AC-SA? the method returns such y! = § that || Vw(5!)ll2 < SIVY Ilo. 
Next, applying RRMA-AC-SA? with y! as a starting point for the same number 
of iterations we will get new point y? such that ||Vw(¥7)l2 < SIV )ll2 < 
FIVE). Then, after / = O (In(RyllV¥O)Il2/e)) of such restarts we can get the 
point ¥/ such that ||VW(7)|l2 < £/Ry with total number of gradients computations 
N1 = O (/Lv/uy In(RvllV¥Oll2/c)). 

When oy # O we need to modify this approach. The first ingredient to 
handle the stochasticity is large enough batch size for the /-th restart: r; should be 
22 (o4,/(W vw 5'-)13). However, in the stochastic case we do not have an access to 
the Vy(y/—!), so, such batch size is impractical. One possible way to fix this issue 
is to independently sample large enough number 7; ~ ¥;/e? of stochastic gradients 
additionally, which is the second ingredient of our approach, in order to get good 
enough approximation VW (y'—!, e, 7) of Vw(y'—!) and use the norm of such 
an approximation which is close to the norm of the true gradient with big enough 
probability in order to estimate needed batch size r! for the optimization procedure. 
Using this, we can get the bound of the following form: 


: - ah ae det IVY) IV AD -— Ve B 
[IVOIRE nA] sar 2 2. 


8 32 


The third ingredient is the amplification trick: we run p; = 2 (In(1/g)) independent 
trajectories of RRMA-AC-SA’, get points y/-!,..., ¥/-?!, and choose such ¥? 
among them that || Vy (7?) ||2 is close enough to minp=1,...,p, || Vw (e"?)||2 with 
high probability, i.c., [Vy (Fo?) |2 < 2minpat,....p IVY"? IS + 27/8R2 with 
probability at least | — 6 for fixed Vue 7). We achieve it due to 
additional sampling of 7 ~ %%/c2 stochastic gradients at y!-? for each trajectory 
and choosing such p(/) corresponding to the smallest norm of the obtained batched 


stochastic gradient. By Markov’s inequality for all p = 1,..., py 
<1, py 112 <I-1 = 1 
P{IVw "IB = 2415" nF} s 5, 
hence 
: <1, py) 2 <I-1 = 1 
Py) min |V¥Q"P)Ilp > 24019 rt < . 
p=1,....Pl 2Pi 


That is, for py = log, (1/8) we have that with probability at least 1 — 28 


IVyshPOy2 < IVY Ia Ieee lay-vwG hz 62 
= 2 


+ 
8 BR? 
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for fixed VW (y!—!, &!!, 7) which means that 


Ivey D5 
2 ARS 


IVeP)I5 < 


with probability at least 1 — 36. Therefore, after 1 = log, (285IIV¥()II3/e2) of such 
restarts our method provides the point y/"?” such that with probability at least 1 — 


318 


2 2 


l-1 
IVeoI «+. , 4 E 
Vv Lp 2 < ae g-k -2= : 
Verna Sa + aR zo < 3Ri T aR? R 


The approach informally described above is stated as Algorithm 24. 


Algorithm 24 Restart ed-RRMA-AC- SA? 


Require: y°—starting point, /—number of restarts, (Febhey {7}, batch sizes, {pidi— 
amplification parameters 


2WANn 
Lyin N Pan 
2 74 — 
My N 


E c 
Choose the smallest integer N > 1 such that ry] 


— 


jr PO ee y? 
fork =1,...,/ do 
Compute Vw (y* 1 pk D gk 1 p(k D ) 
64Coj In? N 
, NV (yeh PD gk LP ERD py /2 


Tk < max {1 


RD A FYN 


Run p,; independent trajectories of RRMA- oe Sa? for N iterations with batch size rz, with 
yk-1,P&—D ag a starting point and get outputs y*!,..., yh Pk 
7: Compute VW (7*!, &e RY, Vy a Pk FR) 
8: p(k) <argmin, 1, , IVY O"?, &?, F)Il2 
9: end for 
Ensure: jy? 


Theorem 12 (Theorem 5.6 from [51]) Assume that is y-strongly convex and 
Ly-smooth. If Algorithm 24 is run with 


2RIVYOIl5 


1 = max { 1, log, —~—_~——+}, , 
9) 62 
2 ia 
; 403 (1+ [ain 4) Rr? 
rp = max { 1, 5 P 
E 
64Co,, In? N 


rp = max }1, = = = = aes 
NIE Ge heen Bee? Ale 
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2 
2 Ip, 2 
12803 (1+ /31n BE) Re 


l 2 
px = max {1 Jo, 5} rk = max ¢ 1, =) (53) 
= . CLi, Int N 1 
for allk = 1,...,l where N > 1 is such that rn < B (0, 1/3) and 
€ > 0, then with probability at least 1 — 3B 
. € 
Te ce aid) este (54) 
y 
and the total number of the oracle calls equals 
: - ee L o2 R2 
Se + N pure + pee) = O (max /—“, “1 ). (55) 
k=1 a 7 


Corollary 2 (Corollary 5.7 from [51]) Under assumptions of Theorem 12 we get 
that with probability at least 1 — 3B 


1,p(l) 


lye =" les , (56) 


byRy 


where B € (0, 1/3) the total number of the oracle calls is defined in (55). 


Now we are ready to present convergence guarantees for the primal function and 
variables. 


Corollary 3 (Corollary 5.8 from [51]) Let the assumptions of Theorem 12 hold. 
Assume that f is L ¢-Lipschitz continuous on Br, (0) where 


Xr ATA R 
Ry - ( My a max ( a ; Ry 


8y Amax (A! A) M Ry 


and Ry, = ||x(A' y*)||2. Then, with probability at least 1 — 4B 


9 
Ax! < — 


Lr 
: é, — ? 
ae] 8Ry 


where B € (0, 1/4), © € (0, My R2) x! & x(AT SPO, &!-PO, 7) and to achieve it 
we need the following number of oracle calls: 


I 2142 

K — _ ~ L ~ o;M r 
) (rk + N perk + prrk) = O (mf ae A), a Kt }) » (58) 
k=1 


f(x!) — f@*) < (2 + (57) 
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where M = ||V f (x*)|l2. 


3.2.3 Direct Acceleration for Strongly Convex Dual Function 
First of all, we consider the following minimization problem: 


min y(y), (59) 


yeR” 


where y(y) is y-strongly convex and Ly-smooth. We use the same notation to 
define the objective in (59) as for the dual function from (22) because later in the 
section we apply the algorithm introduced below to the (22), but for now it is not 
important that y is a dual function for (21) and we prefer to consider more general 
situation. As in Sect. 3.2.1, we do not assume that we have an access to the exact 
gradient of w(y) and consider instead of it biased stochastic gradient Vvr(y, é) 
satisfying inequalities (39) and (40) with 6 > 0 and oy > 0. In the main method of 
this section batched version of the stochastic gradient is used: 


Z jae 
Vw (y, 8) = = Yi Vv, €, (60) 


l=1 


where r; is the batch size that we leave unspecified for now. Note that VW(y, gk ) 
satisfies inequalities (42) and (43). 

We use Stochastic Similar Triangles Method which is stated in this section as 
Algorithm 25 to solve problem (59). To define the iterate z*+! we use the following 
sequence of functions: 


S def 1 > m 
Go(z) S Sle — 2°13 +40 (WO) + (VU, 8), 2 — y) + Sz — y18), 


def ~ ~k-4 cas ~k-4 n a LL ~ 
Bei) = Bez) + one (WORE) + (HUGH, BAY, 2 — HT) + SE lz — 5113) 


k+1 
1 2 ee a MU ~ 
ple 2B + Dom (vo) + (7H, 8, 2-5) + Hie 513). OD 


We notice that g;(z) is (1 + Axjty)-strongly convex. 
For this algorithm we have the following convergence result. 


Theorem 13 (Theorem 5.11 from [51]) Assume that the function w is 1y,-strongly 
convex and Ly-smooth, 


2 N?o3 In & 
n= 0 (moe. ($2) v a 
wv E 
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Algorithm 25 Stochastic similar triangles methods for strongly convex problems 
(SSTM_sc) 
Require: 5° = z 
1: Set ap = “0 =I/Ly 
2: Get VW(y®, ® ) Ee define 20(z) 
3: fork =0,1,. —l1do 


° — y°_-starting point, N—number of iterations 


4: Choose ax+1 ene that Aga, = Ag + oe41, AksiC. + Aguy) = ay Loy 
5: ST & (Any htonpiz aga 
6: ght — argmin,cpn &k+1(Z), Where 8,41 (z) is defined in (61) 
Tr RH = Anya YY ag, 
8: end for 
Ensure: x 


3 2, 
; /2 N2o i(tr/3in§ in a 7 
i.e, re => amaxy 1, a with positive constants C > 0, 


€ > Oand N > 1. If additionally 6 < Nes and é€ < ax where Ro = ||y* — y°|l2 
N 
and Algorithm 25 is run for N iterations, then with probability at least 1 — 3B 


J? R2 
lly —y*Ip s —, (62) 
AN 
where B € (0, 1/3), 
in(4) +inin(#) 2,2 p2 
B b 20-a-R 
(N) i, “peta a 
' - a r| Ly Ly 
+ n(4) 
7 3/2 N 7 3/2 
# =suc(“) ori(v(;) +1)( +2DhG? 4 2¢ (© ) ( 2Du’) 
Ly Ly 
2 2 
h=u=—, c=, 
My My 
je ty 26, 20? +()" 2/2CH +( wy" 4CH 
Hy LybyNVAn wi,N? \ey) Ly eyNVAN Ly wi, NPAN 


~S 
ll 


a ~~) A Ly, \?/2 
7 38D +,/9B?D2 +44 + 8cHc (=) 
max ‘ 
Ly 


2 
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: Ly \? 
By =hG +uC,,/2HC (=) @(N). 
by 


and C is some positive constant. In other words, to achieve \|yN — y*|I5 <€ 


with probability at least 1 — 3B Algorithm 25 needs N = O (/= ) iterations 


and O (max ( / a, “h) oracle calls where O(-) hides polylogarithmic factors 


depending on Ly, ty, Ro, €, and B. 


Next, we apply the SSTM_sc to the problem (22) when the objective of the 
primal problem (21) is L-smooth, jz-strongly convex, and L ¢-Lipschitz continuous 
on some ball which will be specified next, i.e., we consider the same setup as in 
Sect. 3.2.1 but we additionally assume that the primal functional f has L-Lipschitz 
continuous gradient. As in Sect. 3.2.1 we also consider the case when the gradient 
of the dual functional is known only through biased stochastic estimators, see (36)— 
(43) and the paragraphs containing these formulas. 

In Sects. 3.2.1 and 3.2.2 we mentioned that in the considered case dual function 
yw is Ly-smooth on R” and j1y-strongly convex on y® + (KerA')+ where Ly = 
Amax(ATA)/p, and hy = atin(A'A)/L. Using the same technique as in the proof of 
Theorem | 1 we show next that w.].o.g. one can assume that yw is jzy-strongly convex 
on R” since VW(y, &*) lies in ImA = (KerA')+ by definition of VW(y, &*). For 
this purposes we need the explicit formula for z*+! which follows from the equation 
Vee (zt!) = 0: 


0 k+1 1 k+l 


k+1 Z Cihw =i cal zl 
Sa a _ a VW (y, €'). 
P+ Aktiby 7p Lt Antiby T+ Ak+iby d 


z 


I 
(63) 


Theorem 14 (Theorem 5.12 from [51]) For all k => 0 we have that the iterates of 
Algorithm 25 ¥*, z*, y* lie in y° + (Ker(A")). 


This theorem makes it possible to apply the result from Theorem 13 for 
SSTM_sc which is run on the problem (22). 


Corollary 4 (Corollary 5.13 from [51]) Under assumptions of Theorem 13 we 


get that after N = O (, / = In 1) iterations of Algorithm 25 which is run on the 
problem (22) with probability at least 1 — 3B 


IVwO lo < (64) 


€ 
? 
Ry 


where B € (0, !/3) and the total number of oracles calls equals 
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bs / 2 Re 
O (a a _ . \ : (65) 


If additionally € < Ly RS then with probability at least 1 — 3B 


ly” —y*llo < (66) 


Hy Ry 
lly lz < 2Ry. (67) 


Corollary 5 (Corollary 5.14 from [51]) Let the assumptions of Theorem 13 hold. 
Assume that f is L ¢-Lipschitz continuous on Br f (O) where 


a 2C oie 4 VAmax (ATA) E 
PNY dmax(ATA) "7 i Ry 


= ||x(A'y*)ll2, ¢ < [ey RS and by < We for some positive constant G}. 


Assume additionally that the last batch size ry is slightly bigger than other batch 
sizeS, 1.€., 
pagebreak 


222 N 2 2 2 N : 2 
za secs ED), RP of (1+,/3in 4) RB? 


1 
> — max 71, 
i Cc (e 2 


Then, with probability at least 1 — 4B 


~N * 2C 
f& = sory < (2+ (J; Inm(ATA) at oi) Z )s (69) 
E 
Ry’ 


WARN Ia < (14+ V2C + GiVimax(ATA)) 


(70) 


where B € (0, 1/4), 2 KA yy, eN ,’n) and to achieve it we need the total 


number of oracle calls including the cost of computing XN equals 


0 (max eis ee enue ee a"), (71) 


where M = ||V f (x*)|l2. 
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3.3. Applications to Decentralized Distributed Optimization 


In this section, we apply our results to the decentralized optimization problems. First 
of all, we want to add additional motivation to the problem we are focusing on. As 
it was stated in the introductory part of this work, we are interested in the convex 
optimization problem 


Pere f(x), (72) 


where f is a convex function and Q is closed and convex subset of R”. More 
precisely, we study particular case of (72) when the objective function f could be 
represented as a mathematical expectation 


f(x) = Ee [f(, 6)], (73) 


where & is a random variable. Typically x represents the feature vector defining 
the model, only samples of € are available and the distribution of € is unknown. 
One possible way to minimize generalization error (73) is to solve empirical risk 
minimization or finite-sum minimization problem instead, i.e., solve (72) with the 
objective 


a 1 m 
f@= mot 8 (74) 


where m should be sufficiently large to approximate the initial problem. Indeed, if 
f(x, €) is convex and M-Lipschitz continuous for all €, Q has finite diameter D 
and x = argmin,<g J (x), then (see [23, 140]) with probability at least 1 — 6 


2 D2 n 
fQ— min f(x) ag fc D¢n In(m) In (“/8) (75) 


m 


and if additionally f(x, &) is u-strongly convex for all €, then (see [42]) with 
probability at least 1 — 6 


22 i 22 
os nae 0 M2? D? In(m) In (m/s) te | M2.D? In (1/8) . ae 
xEQO pm m 


In other words, to solve (72)+(73) with ¢ functional accuracy via minimization of 
empirical risk (74) it is needed to have m = 2 (™ *D?n/e”) in the convex case and 
m=2 (max {M?D*/ue, M*D°/<?}) in the -strongly convex case where 2(-) hides 
a constant factor, a logarithmic factor of !/g and a polylogarithmic factor of !/e. 
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Stochastic first-order methods such as Stochastic Gradient Descent (SGD) [57, 
115, 120, 128, 161] or its accelerated variants like AC- SA [84] or Similar Triangles 
Method (STM) [39, 47, 118] are very popular choice to solve either (72)+(73) or 
(72)+(74). In contrast with their cheap iterations in terms of computational cost, 
these methods converge only to the neighborhood of the solution, i.e., to the ball 
centered at the optimality and radius proportional to the standard deviation of the 
stochastic estimator. For the particular case of finite-sum minimization problem one 
can solve this issue via variance-reduction trick [26, 54, 64, 138] and its accelerated 
variants [5, 174, 175]. Unfortunately, this technique is not applicable in general 
for the problems of type (72)+(73). Another possible way to reduce the variance 
is mini-batching. When the objective function is L-smooth one can accelerate the 
computations of batches using parallelization [27, 38, 47, 49], and it is one of the 
examples where centralized distributed optimization appears naturally [13]. 

In other words, in some situations, e.g., when the number of samples m is too 
big, it is preferable in practice to split the data into g blocks, assign each block to 
the separate worker, e.g., processor, and organize computation of the gradient or 
stochastic gradient in the parallel or distributed manner. Moreover, in view of (75)— 
(76) sometimes to solve an expectation minimization problem it is needed to have 
such a big number of samples that corresponding information (e.g., some objects 
like images, videos, etc.) cannot be stored on | machine because of the memory 
limitations (see Sect. 3.5 for the detailed example of such a situation). Then, we can 
rewrite the objective function in the following form: 


tt << fe 
f(x) = 7 fie, filx) = Eg [f (%, &)] oF fils) = = D7 fOr, fi). 
i=l ' j=l 
(77) 


Here f; corresponds to the loss on the i-th data block and could be also represented 
as an expectation or a finite sum. So, the general idea for parallel optimization is to 
compute gradients or stochastic gradients by each worker, then aggregate the results 
by the master node and broadcast new iterate or needed information to obtain the 
new iterate back to the workers. 

The visual simplicity of the parallel scheme hides synchronization drawback and 
high requirement to master node [135]. The big line of works is aimed to solve 
this issue via periodical synchronization [53, 72, 75, 148, 165, 166, 172], error 
compensation [16, 55, 71, 149], quantization [4, 61, 62, 109, 164], or combination 
of these techniques [11, 106]. 

However, in this work we mainly focus on another approach to deal with 
aforementioned drawbacks—decentralized distributed optimization [13, 73]. It is 
based on two basic principles: every node communicates only with its neighbors and 
communications are performed simultaneously. Moreover, this architecture is more 
robust, e.g., it can be applied to time-varying (wireless) communication networks 
[134]. 
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But let us consider first the centralized or parallel architecture. As we mentioned 
in the introduction, when the objective function is L-smooth one can compute 
batches in parallel [27, 38, 47, 49] in order to accelerate the work of the method 
and get the method (see Section 3 from [51] for the details) using 


oR? /e2 o7/ne - 
O Tey, or O Jifain («Jo) (78) 
workers and having the working time proportional to the number of iterations of 
an accelerated first-order method. However, the number of workers defined in (78) 
could be too big in order to use such an approach in practice. But still computing 
the batches in parallel even with much smaller number of workers could reduce the 
working time of the method if the communication is fast enough. 
Besides the computation of batches in parallel for the general type of problem 
(72)+(73), parallel optimization is often applied to the finite-sum minimization 
problems (72)+(74) or (72)+(77) that we rewrite here in the following form: 


1 m 
ani, $@) = = » fic(x). (79) 
We notice that in this section m is a number of workers and f;(x) is known only 
for the k-th worker. Consider the situation when workers are connected in a network 
and one can construct a spanning tree for this network. Assume that the diameter 
of the obtained graph equals d, i.e., the height of the tree—maximal distance (in 
terms of connections) between the root and a leaf [135]. If we run Similar Triangles 
Methods (STM, [47]) on such a spanning tree, then we will get that the number of 
communication rounds will be 


2 p2 2 2 2 
0 (a amin | n() 2 n(= )m()]). 
E B be é B 


where 
LR2 |L LR? 
N=O|min : In 
€ iu € 


Now let us consider the decentralized case when workers can communicate only 
with their neighbors. Next, we describe the method of how to reflect this restriction 
in the problem (79). Consider the Laplacian matrix W € R”*" of the network with 
vertices V and edges E which is defined as follows: 
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—l, if, j) € E, 
Wij = \deg(i), ifi = j, (80) 
0) otherwise, 


where deg(i) is degree of i-th node, i.e., number of neighbors of the i-th worker. 


Since we consider only connected networks the matrix W has unique eigenvector 
def 


1, = (,...,1)' € R” corresponding to the eigenvalue 0. It implies that for all 
vectors ad = (dj,..., Ga € R” the following equivalence holds: 
ence SS WaH 0. (81) 


Now let us think about a; as a number that i-th node stores. Then, using (81) we 
can use Laplacian matrix to express in the short matrix form the fact that all nodes 
of the network store the same number. In order to generalize it for the case when 


F . def — 
a; are vectors from R” we should consider the matrix W = W® I, where ® 


represents the Kronecker product. Indeed, if we consider vectors x1, ..., Xm € R” 
and x = (xj ,...,x,,) € R””, then (81) implies 
Xp=...=Xm —> Wx=0. (82) 


For simplicity, we also call W as a Laplacian matrix and it does not lead to 
misunderstanding since everywhere below we use W instead of W. The key 
observation here that computation of Wx requires one round of communications 
when the k-th worker sends x, to all its neighbors and receives x; for all j such 
that (k, 7) € E, ie., k-th worker gets vectors from all its neighbors. Note that W 
is symmetric and positive semi-definite [135] and, as a consequence, VW exists. 
Moreover, we can replace W by /W in (82) and get the equivalent statement: 


XPS... =H Xm SS VWx=0. (83) 


Using this we can rewrite the problem (79) in the following way: 


m 


min f(x) 1 Ss Pen 2 
—— — x ' 

J/Wx=0, m k (Xk 

XL yee Xm€ OCR" = 


We are interested in the general case when fi (xx) = Ez, [ fx (xx, &&)] where {& hel 
are independent. This type of objective can be considered as a special case of (77). 
Then, as it was mentioned in the introduction it is natural to use stochastic gradients 
V fi (xk, &) that satisfy 


oe [V fi (xe, &)] — Vfe(xe) |, < 6, (85) 


Recent Theoretical Advances in Decentralized Distributed Convex Optimization 297 


= 2 
‘ [es ( || V fe (re, &) — Egy [V faxes why) or ae 


o2 


Then, the stochastic gradient 


ie: ie 1 
Vf (x, &) © VF (x, (Ed) & 


m 


So V fe(re, &) 
k=1 


satisfies (see also (43)) 


2 
. fan (Pee = eee eih) exci) 


oe 


with oF = O (9°/m). 

As always, we start with the smooth case with Q = R” and assume that each 
J, is L-smooth, jz-strongly convex and satisfies ||Vz fx(x%)||l2 < M on some ball 
Bry (x*) where we use V; f (xz) to emphasize that f; depends only on the k-th n- 
dimensional block of x. Since the functional f(x) in (84) has separable structure, it 
implies that f is £/m-smooth, “/m-strongly convex and satisfies || V f(x) |l2 < “@/Vm 
on B mRy (x*). Indeed, for all x, y € R” 


m 


IIx — yl3 = Do Ue — yell3, 


k=1 


1 m L2 m L 
= 2 2 
IVF@)-—VFMle = | 5 » Ii fx) — Ve fe WIG S| 5 3 Ite — yell = = IIx — yllo, 


1 m 1 m 
fe) = — 2, Fels) a B (FO) + (We Fe(e), x4 — ye) + Flix — y*IB) 
u 2 
= FY +(VFM) XY) + 5 IK — Yl, 
1 m 
IV FOS = =z DIVE fee). 


k=1 


Therefore, one can consider the problem (84) as (21) with A = /W and Q = 


R””. Next, if the starting point x° is such that x° = ((x°)', ..., (v°)")', then 
def def IV fx )II3 M? 
R?S |x? — x*]]5 = mlix° — x" 5 =mR?, Ry = lly“ < ——— < 
2 2 ’ y Y ll2 h n W 2: n WwW 

min ( ) m min ( ) 


Now it should become clear why in Sect.3.1 we paid most of our attention on 
number of A! Ax calculations. In this particular scenario A! Ax = V Wiv Wx = 
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Wx which can be computed via one round of communications of each node with its 
neighbors as it was mentioned earlier in this section. That is, for the primal approach 
we can simply use the results discussed in Sect. 3.1. For convenience, we summarize 
them in Tables 3 and 4 which are obtained via plugging the parameters that we 
obtained above in the bounds from Sect. 3.1. Note that the results presented in this 
match the lower bounds obtained in [8] in terms of the number of communication 
rounds up to logarithmic factors and there is a conjecture [32] that these bounds are 
also optimal in terms of number of oracle calls per node for the class of methods that 
require optimal number of communication rounds. Recently, the very similar result 
about the optimal balance between number of oracle calls per node and number of 
communication round was proved for the case when the primal functional is convex 
and L-smooth and deterministic first-order oracle is available [168]. 

Finally, consider the situation when Q = R” and each f; from (84) is dual 
friendly, i.e., one can construct dual problem for (84) 


min Wy), wherey=(y/,...,yn)' €R™, yi,---,¥m € R", (87) 
ye nm 
Pk (Ye) = max {(ye, Xe) — fee)}, (88) 
xzER" 


1 m 1 m 
Py) = — Y  gimyn), YY) = OC Wy) = — YF ge(miV Wl), (89) 
k=1 


k=1 


where [./ Wx]; is the k-th n-dimensional block of ./ Wx. Note that 


m 


1 m 
mgs, 8) ~ £09) = ps, 4 Yt) — 5D SiC) 


Table 3. Summary of the covered results in this paper for solving (84) using primal deterministic 
approach from Sect. 3.1. First column contains assumptions on f;, k = 1, ..., m in addition to the 
convexity, x = x(W) = Amax(W)/a*. (W), where Amax(W) and aw. (W) are maximal and minimal 


positive eigenvalues of matrix W. All methods except D-MASG should be applied to solve (25) 


Assumptions on fx 


Method 


j-strongly convex, 
L-smooth 


L-smooth 


-strongly convex, 
IV fe )ll2 < M 


IV filo <M 


D-MASG, O = R’, 
[41] 


STP_IPS with STP 
as a subroutine, 


Q=R’, [51] 
R-Sliding, [32, 
85, 86, 88] 
Sliding, 

[85, 86, 88] 


# of communication 


# of V fx (x) oracle 
calls pernode 
yy L 
O ( E 


aC) 
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Table 4 Summary of the covered results in this paper for solving (84) using primal stochastic 
approach from Sect. 3.1 with the stochastic oracle satisfying (85)—-(86) with 6 = 0. First column 


contains assumptions on fy, kK = l,..., m in addition to the convexity, x = x(W) = 
Amax(W)/27.(W), Where Amax(W) and Ati (W) are maximal and minimal positive eigenvalues of 


matrix W. All methods except D-MASG should be applied to solve (25). The bounds from the last 
two rows hold even in the case when Q is unbounded, but in the expectation (see [91]) 


# of 
communication # of V fx (x, &) oracle 
Assumptions on f,; | Method rounds calls per node 
-strongly D-MASG, in O ( 7x O (max [Jz x \) 
convex, L-smooth expectation, 
|O=R", 41 
L-smooth SSTP_IPS with O ( ER x) O (max lv a oR |) 
STP asa 
subroutine, OQ = 
R” ,conjecture, 
: (32, 51] 
2 ae ~ M2 ~ ( M2+¢62 
j-strongly RS-Sliding Q O ( fx) O ( ie. ) 
convex, is bounded, 
IV fe@)ll2<M | [32, 85, 86, 88] 
IVf@ll2<M | S-Sliding Q is 5( ee x) 6 (Geer) 
bounded, 
| [85, 86, 88] 


m 


iT 1 m 
= — )F max {(myg, xx) — few) = — DO ve(myy) = OGY), 
m Pee a m k=1 


so, ®(y) is a dual function for f(x). As for the primal approach, we are interested 


in the general case when 9x (yx) = Eg, [@x (vx, &&)] where {&}7"_, are independent 
and stochastic gradients Vgx (xx, &) satisfy 


Ea (Von, &)1 — VorQw || < 8p, (90) 
2 
V ,&) — Ez, [V 
ie bees || Vox (ve. &) aL K(Ve> ERI ||5 Sept: 1) 


oO 
Consider the stochastic function f; (xx, &) which is defined implicitly as follows: 
pk (Yes Fe) = max {(ye, Xk) — fre, E)}- (92) 
xpER" 


Since 


m 35 m 7 . 
VO(y) = >> Vox(myx) ») Yo xe(myn) LS x(y), xe) S argmax {(ye, ax) — fie(te)} 
k=l k=l xeR" 
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it is natural to define the stochastic gradient V ®(y, &) as follows: 


VO(y,E) = VOL, Exle4) = Y- Voetmyn, &) = D> xe(mye, &) = xy, 8), 
k=1 k=1 


xi (Ves Ee) & argmax { (yx. Xk) — fe tks Ed} - 
xpER" 


It satisfies (see also (43)) 


|Ee [VO(y, )] -— VOQ)|, < 50, 
7 2 
7 fe» ( Very, §) — ee IVY, al Seat) 
°@ 


with dg = még and oe =O (mo’). Using this, we define the stochastic gradient of 


Wy) as VW(y, &) a VWV8(/Wy, é)= JVWx(/Wy, €) and, as a consequence, 


we get 


A 


|Es [VY (y,€)] -— VY) ||, < du. 


| Vy, €) — Es [V(y, I ||5 
~ | exp 5 


Oy 


IA 


exp(1) 


with dy = /Amax(W)do and ow = /Amax(W)oo@. 

Taking all of this into account we conclude that problem (87) is a special case of 
(22) with A = /W. To make the algorithms from Sect. 3.2 distributed we should 
change the variables in those methods via multiplying them by /W from the left 
(32, 34, 159], e.g., for the iterates of SPDSTM we will get 


geet — Jwyktt, k+l = Vwektl, yl —_ JWyktt, 


which means that it is needed to multiply lines 4-6 of Algorithm 20 by /W 
from the left. After such a change of variables all methods from Sect. 3.2 become 
suitable to run them in the distributed fashion. Besides that, it does not spoil the 
ability of recovering the primal variables since before the change of variables all 
of the methods mentioned in Sect. 3.2 used x(/W: y) or x(/W: y, €) where points 
y were some dual iterates of those methods, so, after the change of variables we 
should use x(y) or x(y, &), respectively. Moreover, it is also possible to compute 
|W. x15 = (x, Wx) in the distributed fashion using consensus type algorithms: 
one communication step is needed to compute Wx, then each worker computes 
(xx, [Wx]x) locally and after that it is needed to run consensus algorithm. We 
summarize the results for this case in Table 5. Note that the proposed bounds are 
optimal in terms of the number of communication rounds up to polylogarithmic 
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Table 5 Summary of the covered results in this paper for solving (87) using dual stochastic 
approach from Sect.3.2 with the stochastic oracle satisfying (85)-(86) with 6 = 0 for 
R-RRMA-AC-SA? and 59 = O (e/u yng) for SSTM_sc and SPDSTM. First column contains 


assumptions on f;, k = 1, ...,m in addition to the convexity, x = x(W) 
# of 
communication | # of V@;(y, &) oracle 
Assumptions on fi Method rounds calls per node 
j4-strongly convex, R-RRMA-AC- SA? O ( Ly) O (max | / Lx coe x }) 
L-smooth, (Algorithm 24), 
IV f(x) lo < M Corollary 3, 
SSTM_sc 
(Algorithm 25), 
Corollary 5 
~ ~ > 2 yy2 
jl-strongly convex, SPDSTM O ( x) O (max {V x : vom }) 
IV flo <M (Algorithm 20), 
Theorem 8 


factors [8, 135-137]. Note that the lower bounds from [135-137] are presented for 
the convolution of two criteria: number of oracle calls per node and communication 
rounds. One can obtain lower bounds for the number of communication rounds itself 
using additional assumption that time needed for one communication is big enough 
and the term which corresponds to the number of oracle calls can be neglected. 
Regarding the number of oracle calls there is a conjecture [32] that the bounds that 
we present in this paper are also optimal up to polylogarithmic factors for the class 
of methods that require optimal number of communication rounds. 


3.4 Discussion 


In this section, we want to discuss some aspects of the proposed results that were 
not covered in the main part of this paper. First of all, we should say that in the 
smooth case for the primal approach our bounds for the number of communication 
steps coincide with the optimal bounds for the number of communication steps for 
parallel optimization if we substitute the diameter d of the spanning tree in the 
bounds for parallel optimization by O(./x(W)). 

However, we want to discuss another interesting difference between parallel and 
decentralized optimization in terms of the complexity results which was noticed in 
[32]. From the line of works [81—83, 90] it is known that for the problem (72)+(77) 
(here we use m instead of g and iterator k instead of i for consistency) with L- 
smooth and j-strongly convex f; for all k = 1,...,m the optimal number of 
oracle calls, i.e., calculations of the stochastic gradients of f, with o?-subgaussian 
variance, is 
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a( ; =) 
O(m+.jm—+—]. (93) 
\ “pe 


The bad news is that (93) does not work with full parallelization trick and the best 
possible way to parallelize it is described in [90]. However, standard accelerated 
scheme with mini-batched stochastic gradients without variance reduction technique 
and incremental oracles gives the bound 


( ) 
0 (m/=+— (94) 
wo pe 


for the number of oracle calls and it admits full parallelization. It means that in the 
parallel optimization setup when we have computational network with m nodes and 
the spanning tree for it with diameter d the number of oracle calls per node is 


O — + — ] = O| max _, (95) 
uw me [Lo me 


and the number of communication steps is 


“0 . 


However, for the decentralized setup the second row of Table 4 states that the 
number of communication rounds is the same as in (96) up to substitution of d 
by ./x(W) and the number of oracle calls per node is 


O | max = (97) 
bo pe 


which has m times bigger statistical term under the maximum than in (95). What is 
more, recently it was shown that there exists such a decentralized distributed method 


that requires 
im fon 2 
Ola 
MLE 


stochastic gradient oracle calls per node [121, 122], but it is not optimal in terms 
of the number of communications. Recently a stochastic optimization method with 
consensus subroutine for time-varying graphs requiring O (0? /(n [Lé)) oracle calls 
and O (J/L/“ >) communications was proposed in [131]. The results of [131] can 
be easily extended to OWL] VX ) communication complexity in the time-static 
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case via employing accelerated consensus with Chebyshev acceleration. Moreover, 
there is a hypothesis [32] that in the smooth case the bounds from Tables 3 and 4 
(rows 2 and 3) are not optimal in terms of the number of oracle calls per node and 
optimal ones can be found in Table 2. 


3.5. Application for Population Wasserstein Barycenter 
Calculation 


In this section we consider the problem of calculation of population Wasserstein 
barycenter since this example hides different interesting details connected with the 
theory discussed in this paper. In our presentation of this example we rely mostly 
on the recent works [30, 31]. 


3.5.1 Definitions and Properties 


We define the probability simplex in R” as S,(1) = {x ER | 4 = 1}. 
One can interpret the elements of S,(1) as discrete probability measures with n 
shared atoms. For an arbitrary pair of measures p,q € S,(1) we introduce the 
set IT(p,q) = {x € RX" | 21 = p, 1'1= G} called transportation polytope. 
Optimal transportation (OT) problem between measures p,g € S,(1) is defined 
as follows 


W(pP.q) = i Cir) = CijTij; 98 
D> Fee" > vm on 


where C is a transportation cost matrix. That is, (i, j)-th component Cj; of C isa 
cost of transportation of the unit mass from point x; to the point x; where points are 
atoms of measures from S, (1). 

Next, we consider the entropic OT problem (see [123, 127]) 


n 
Wp, q) = ee a (Coy + LT j In Tj) , (99) 
L,jJ= 
Consider some probability measure P on S,(1). Then one can define population 
barycenter of measures from S,(1) as 


ps = argmin / W,(p, aP(q) = argminE, [V,(p.q)]- (100) 
pEeSn(l) ¥geSp(1) pESn (1) ee 
p(P) 


For a given set of samples q!,..., q’” we introduce empirical barycenter as 
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1 
p,, = = aremin 3 WD. 4 , (101) 
peS,(1) ™ 


Wp) 


We consider the problem (100) of finding population barycenter with some accuracy 
and discuss possible approaches to solve this problem in the following subsections. 

However, before that, we need to mention some useful properties of W,,(p, q). 
First of all, one can write explicitly the dual function of W,,(p, q) for a fixed g € 
Sn (1) (see [25, 30]): 


ip. q) = max | (d. p) — Wy, 00} (102) 
n l n a oY ie 

Wi.) = WY jai in (7 Se (22%). (103) 
j=l J j=1 


Using this representation one can deduce the following theorem. 


Theorem 15 ({30]) For an arbitrary q € S,(\) the entropic Wasserstein distance 
Wiulsqg) : Sn) > R is p-strongly convex wrt. €2-norm and M-Lipschitz 
continuous w.r.t. £2-norm. Moreover, M < ./nMoo where Mo is Lipschitz constant 
of Mul, q) wrt. Loo-norm and> Moy = O(IClloo)- 

We also want to notice that function Ww; (A) is only strictly convex and the 
minimal eigenvalue of its hessian y = Aint Me, (A*)) evaluated in the solution 
r* def argmax, ejpn a, P)- Lae coy is very small and there exist only such 
bounds that are exponentially small inn. 

We will also use another useful relation (see [30]): 


VI (P.g) =A", (A*,1) =0, (104) 


where the gradient V¥,,(p, q) is taken w.r.t. the first argument. 


3.5.2 SA Approach 


Assume that one can obtain and use fresh samples g!,q,... in online regime. 
This approach is called Stochastic Approximation (SA). It implies that at each 
iteration one can draw a fresh sample g* and compute the gradient w.r.t. p of 
function Y,,(p, q*) which is ju-strongly convex and M-Lipschitz continuous with 


> Under assumption that measures are separated from zero, see the details in [21] and the proof of 
Proposition 2.5 from [30]. 
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M=O (/n||C loo). Optimal methods for this case are based on iterations of the 
following form: 


pO = Proj (1) (>! — me VO (p*, q")) ; 


where projs, (1) (x) is a projection of x € R” on S,(1) and the gradient VY, (p*,q*) 
is taken w.r.t. the first argument. One can show that restarted-SGD (R-SGD) 
from [67] that using biased stochastic gradients (see also [30, 46, 65]) VV (p,q) 
such that 


IW%i(p. 4) -V%ulP. ile <6 (105) 


for some 6 > 0 and for all p, g € S,(1) after N calls of this oracle produces such a 
point p% that with probability at least 1 — 6 the following inequalities hold: 


n||C|\5. In(N/e) 
Wp’) — W( pe) = O | ——2_— +8 106 
w(P) — W@u(Py) ( uN + (106) 
and, as a consequence of -strong convexity of Y,,(p, q) for all g, 
N x nl|C|Z, n/a) 5 
- =O . 107 
Ip’ — pill a (107) 
That is, to guarantee 
Ip” — pile <e (108) 
with probability at least | — 6, R-SGD requires 
25 C\l2 5 
O (Se) VW,(p, q) oracle calls (109) 


under additional assumption that 6 = O (we). 

However, it is computationally hard problem to find V¥%.(p,q) with high 
accuracy, i.e., find VV (p,q) satisfying (105) with 6 = O(e7). Taking into 
account the relation (104) we get that it is needed to solve the problem (102) with 
accuracy 6 = O(e7) in terms of the distance to the optimum. That is, it is needed to 
find such A that ||A — A*||2 < 5 and set VVC, q) =A. Using variants of Sinkhorn 
algorithm [58, 80, 150] one can show [30] that R-SGD finds point pe such that 
(108) holds with probability at least 1 — 6 and it requires 
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~ (n||C|l2 C C C 
a(" i loc min exp (" ie) (! Hs + in(! “)). =ai) 
ME bh bh YUE YPre 


(110) 
arithmetical operations. 
3.5.3 SAA Approach 
Now let us assume that large enough collection of samples q!,..., q’” is available. 


Our goal is to find such p € S,(1) that ||p — pjill2 < ¢ with high probability, 
i.e., €-approximation of the population barycenter, via solving empirical barycen- 
ter problem (101). This approach is called Stochastic Average Approximation 
(SAA). Since Y,,(p, q') is y-strongly convex and M-Lipschitz in p with M = 


O(JnIIC |loo) for alli = 1, ...,m we can conclude that with probability > 1 — 6 


7 C [25 In(m) In (™ C|l2, In 
Wy BE) — Wal pe) 2) 9 [ MIClleo eo a 4. / nll eC . 


(111) 


where we use that the diameter of S,,(1) is O(1). Moreover, in [140] it was shown 
that one can guarantee that with probability > 1 — B 


bes a (16) , (nlIC ll, 
W, (BX) — Wp) E (Soe), (112) 


Taking advantages of both inequalities we get that if 


es) C\l2 C\l2 C|l2 - C\l2 C\l2 
m= £5 (min | max { MCU, mC. | mC. |) «a (min f UCHBD WCBS | 
ice Mee Bure pet” Bure 


(113) 


then with probability at least 1 — 5 


mn 2 rm (111),(112),(113) € 
le, —P, les 'E (M.(b%) = W,(p*)) < 5 (114) 


Assuming that we have such p € S,(1) that with probability at least 1 — : the 
inequality 


(115) 


ae € 
IP — Bulla <5 
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holds, we apply the union bound and get that with probability > 1 — B 


IP — Pilla < lb — Pella +P — ilk <e. (116) 


It remains to describe the approach that finds such p € S,,(1) that satisfies 
(116) with probability at least 1 — 6. Recall that in this subsection we consider 
the following problem: 


A | eel 
W, =— W(p,g' in . 117 
u(P) = — d, (p,q) > min (117) 


For each summand Y,,(p, q') in the sum above we have the explicit formula 
(103) for the dual function Whi yA) Note that one can compute the gradient of 


Wii uA) via O(n”) arithmetical operations. What is more, Wii uA) has a finite- 


sum structure, so, one can sample j-th component of g‘ with probability qi and get 
stochastic gradient 


Vs Os p=19(t (J Se (- uty) (118) 


qj jal 


which requires O(n) arithmetical operations to be computed. 

We start with the simple situation. Assume that each measures q' are stored on 
m separate machines that form some network with Laplacian matrix W ¢ R”*”, 
For this scenario we can apply the dual approach described in Sect. 3.3 and apply 
bounds from Table 5. If for alli = 1, ..., m the i-th node computes the full gradient 
of dual functions Wai, ,, at each iteration then in order to find such a point Pp that with 


probability at least 1 — 5 


WB) — Wi Be) <8, (119) 


where W = W @ J, this approach requires O (y IC x(W)) communica- 


tion rounds and O (ws 4/ IB x(W)) arithmetical operations per node to find 


gradients VW* (A). If instead of full gradients workers use stochastic gradients 
V4 As J) defined in (118) and these stochastic gradients have light-tailed 


distribution, i.e., satisfy the condition (91) with parameter o > 0, then to guarantee 
(119) with probability > 1 — B the aforementioned approach needs the same num- 


~ 2 2 2, 
ber of communications rounds and O (x max {Vv oo x(W), = LoS x wy} 


arithmetical operations per node to find gradients Vwi us J). Using p-strong 


; 2 
convexity of Y,,(p, q') for alli = 1,...,m and taking é = a we get that our 
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approach finds such a point p that satisfies (115) with probability at least 1 — £ 
using 


~ C 
O (viele x0) communication rounds (120) 
[Le 


and 


O (7st vxWW)) (121) 


arithmetical operations per node to find gradients in the deterministic case and 


6 (max { MCT (a7), MN Hee ae: x} ) 


arithmetical operations per node to find stochastic gradients in the stochastic case. 
However, the state-of-the-art theory of learning states (see (113)) that m should be 
so large that in the stochastic case the second term in the bound for arithmetical 
operations typically dominates the first term and the dimensional dependence 
reduction from n?° in the deterministic case to n! in the stochastic case is 
typically negligible in comparison with how much ee x(W) is larger than 


Velloe CW). That is, our theory says that it is better to use full gradients in the 
particular example considered in this section (see also Sect. 3.4). Therefore, further 
in the section we will assume that 02 = 0, i.e., workers use full gradients of dual 
functions Tai (A). 

However, bounds (120)-(121) were obtained under very restrictive at the first 
sight assumption that we have m workers and each worker stores only one measure 
which is unrealistic. One can relax this assumption in the following way. Assume 
that we have / < m machines connected in a network with Laplacian matrix W and 


j-th machine stores mj > 1 measures for j = 1, ere and jam m; = m. Next, 
for j-th machine we introduce mj; virtual workers also connected in some network 
that j-th machine can emulate along with communication between virtual workers 
and for every virtual worker we arrange one measure, e.g., it can be implemented as 
an array-like data structure with some formal rules for exchanging the data between 
cells that emulates communications. We also assume that inside the machine we 
can set the preferable network for the virtual nodes in such a way that each machine 
emulates communication between virtual nodes and computations inside them fast 
enough. Let us denote the Laplacian matrix of the obtained network of m virtual 
nodes as W. Then, our approach finds such a point p that satisfies (115) with 
probability at least 1 — f using 
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O ( max ton) ee) (122) 
1 


J=l 
ee 


Tem, max 


time to perform communications and 


2 C 
O ( max rs p25 Elle Koa (123) 
j i ME 


time for arithmetical operations per machine to find gradients where Tom,; is 
time needed for j-th machine to emulate communication between corresponding 
virtual nodes at each iteration and 7p; is time required by j-th machine to 
perform | arithmetical operation for all corresponding virtual nodes in the gradients 
computation process at each iteration. For example, if we have only one machine and 
network of virtual nodes forms a complete graph than x¥(W) = 1, but Tom max and 
Tep,max can be large and to reduce the running time one should use more powerful 
machine. In contrast, if we have m machines connected in a star graph than Tom max 
and Tep.max Will be much smaller, but x(W) will be of order m which is large. 
Therefore, it is very important to choose balanced architecture of the network at 
least for virtual nodes per machine if it is possible. This question requires a separate 
thorough study and lies out of scope of this paper. 


3.5.4 SA vs SAA Comparison 


Recall that in SA approach we assume that it is possible to sample new measures 
in online regime which means that the computational process is performed on one 
machine, whereas in SAA approach we assume that large enough collection of mea- 
sures is distributed among the network of machines that form some computational 
network. In practice measures from S,,(1) correspond to some images. As one can 
see from the complexity bounds, both SA and SAA approaches require large number 
of samples to learn the population barycenter defined in (100). If these samples are 
images, then they typically cannot be stored in RAM of one computer. Therefore, it 
is natural to use distributed systems to store the data. 

Now let us compare complexity bounds for SA and SAA. We summarize them 
in Table 6. When the communication is fast enough and yp is small we typically 
have that SAA approach significantly outperforms SA approach in terms of the 
complexity as well even for communication architectures with big x (W). Therefore, 
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Table 6 Complexity bounds for SA and SAA approaches for computation of population barycen- 
ter defined in (100) with accuracy e. The third row states the complexity bound for SA approach 
when the second term under the minimum in (110) is dominated by the first one, e.g., when jz 
is small enough. The last row corresponds to the case when Temmax = O(1), Tep,max = O(1), 
/B = &,e.g., 8 = 0.01 and e < 0.1, and the communication network is star-like, which implies 
x(W) = Q2(m) 


Approach Complexity 


SA 6 (ite min fexp (HAE) (4s +in(44s)). (=H) 


arithmetical operations 


~ TAI CIZ 
SA, the 2-d term is O (Se) arithmetical operations 
smaller 
SAA O (Fes, max Yalleleo / x W)) time to perform communications, 
O Te maxl2> MC lloo (W) ) time for arithmetical operations per 
P, pe Xx P' P' 
~ 2. 2 
machine, where m = {2 (n min | oe, ae | 
= z = TGs 
SAA, x(W) = Q2(m), O ( “ats ) communication rounds, O (ois) arithmetical 
VBure J Bye 
Tem,max = O(1), operations per machine 
Tep,max = O(1), 
VB ze 


for balanced architecture one can expect that SAA approach will outperform SA 
even more. 

To conclude, we state that population barycenter computation is a natural 
example when it is typically much more preferable to use distributed algorithms 
with dual oracle instead of SA approach in terms of memory and complexity bounds. 


4 Derivative-Free Distributed Optimization 


As mentioned above in Sect.3, the decentralized optimization problem can be 
rewritten as a problem with affine constraints: 


1 m 
min f(x) = — SAG, (124) 


i=1 
é def => . 1 7 
where we use matrix W = W@I, for Laplacian matrix W = ||Wi; Wein jee 


of the connection graph. In turn, the problem with affine constraints: 


gene? 


is rewritten in a penalized form as follows: 
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R2 
min F(x) = f(x) + —||Axll3, (125) 
xeQ £ 


with some positive constants ¢ and Ry (for details see Sect. 3). As a result, we have 
a classical composite optimization problem; therefore, this section will focus on this 
problem. In what follows, we will rely on work [15]. Note that the work [147] with a 
similar results has recently appeared (unlike work [15], it considers a more practical 
one-point feedback—for a more detailed explanation of the difference, see [147]). 
Note also that results of [15, 147] can be generalized for saddle-point problems by 
using proper version of Sliding technique [89]. We will find out a method based on 
the Sliding Algorithm (see [85] and Sect. 3) for the convex composite optimization 
problem with smooth and non-smooth terms. One can find gradient-free methods 
for distributed optimization in the literature (see [98, 154]), but the method that will 
be discussed further is the first, which combines zeroth-order and first-order oracles. 
It uses the first-order oracle for the smooth part and the zeroth-order oracle for the 
non-smooth part. 


4.1 Theoretical Part 
4.1.1 Convex Case 
We consider® the composite optimization problem 


7 Yo(x) = f(x) + g(x). (126) 


In this part of paper, we will work not in the Euclidean norm ||- ||, but in a certain 
norm || - || (and the dual norm || - ||, for the norm || - ||). Also define the Bregman 
divergence associated with some function v(x), which is 1-strongly convex w.r.t. 
|| - ||-norm and differentiable on Q, as follows 


VQ, y) =v) — v@) — (Vv@), y— x), Vx, y € Q. 


The use of Bregman divergence and special norms allows taking into account the 
geometric setup of the problem. For example, when we work with the problem in a 
probability simplex, it seems natural to use the || - || }-norm and the Kullback—Leibler 
divergence. 

Next, we introduce some assumptions for problem (126): Q C R” is a compact 
and convex set with diameter Dg in || - ||-norm, function g is convex and L-smooth 
on Q w.r.t. norm || - ||, 1.e., 


6 The narrative in this section follows [15]. 
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IVgx) — Ve(y)llx < Lilx — yl, Wx, ye Q, 
f is convex differentiable function on Q. 
Assume that we have an access to the first-order oracle for g, i.e., gradient V g(x) 


is available, and to the biased stochastic zeroth-order oracle for f (see also [19, 52]) 
that for a given point x returns noisy value f(x, €) such that 


f@, &) = f@,8)+ AG), (127) 
where A(x) is a bounded noise of unknown nature 
|A(x)|< A 


and random variable € is such that 


“Lf, 6)] = fF). 


Additionally, we assume that for all x € Qs (s < Dg) 


IV (x, Ello < M(~E), ELM?(é)] = M?. 


It is important to note that for the function f(x) these assumptions are made only 
for theoretical estimates; we have no real access to V f(x). The question is how 
to replace the gradient of the function f(x). The easiest way is to collect gradient 
completely using finite differences: 


Ife 7 
fan @, €) = : > (fo +rhj,&)— f(x —rhj, é)) hi, (128) 


i=1 


here we consider a standard orthogonal normalized basis {h1,..., h,}. This way we 
really get a vector close to the gradient. The obvious disadvantage of this method is 
that one need to call the oracle for f (x, €) 2n times. Another way is to use random 
direction e uniformly distributed on the Euclidean sphere (see [119, 142]): 


fr, §e) = 5 (P(x +re.8) — fe —re.eye. (129) 


In particular, the authors of [15] use this approximation. 

Now another problem arises—we need to combine the zeroth-order and first- 
order oracles for different parts of the composite problem. It seems natural that the 
gradient-free oracle should be called more often than the gradient one. The authors 
of paper [15] solve this problem and propose to apply the algorithm based on Lan’s 
Sliding [85]. The basic idea is that we fix V g and iterate through the inner loop (PS 
procedure), changing only the point x in fi (x, €,e). In the Algorithm 26 we need 
the following function: 
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Ip(x, y) = g(x) + (Vg(x), y — x). (130) 


It is important that the random variables &; are independent, and also e; is sampled 
independently from previous iterations. 


Algorithm 26 Zeroth-order sliding algorithm (zoSA) 


Input: Initial point x9 € Q and iteration limit NV. 
Let By € Bra, Vv € Hy, and  e N,k = 1,2,..., be given and set Xp = xo. 
fork =1,2,...,N do 
1. Set x, = = Ve )¥R-1 + VexXK-1, and let hy (-) = Lg (x,, -) be defined in (130). 
2. Set 


(xk, Xk) = PS(hg, XK-1, Bes Tk); 


3. Set Xp = (1 = Ye)XK-1 + VeXk- 
end for 
Output: xy. 


The PS (prox-sliding) procedure. 

procedure: (x+, x+) = PS(h, x, B, T) 

Let the parameters p; € R4+ and 6 € [0, 1],t = 1,..., be given. Set uo = to = x. 
fort = 1,2,...,7 do 


Uy 


argmin | (a) + (fp (tts Bi-ts€r-1)ou) + BV Cu) + BPrV 1.1) 
ue 


ay = (1 — 6 )uy—-1 + Our. 
end for 


Set xt = uy andx* = air. 
end procedure: 


We also note that zoSA (in contrast to the basic version—Algorithm 19) takes 
into account the geometric setting of the problem and uses Bregman divergence 
V(x, y) instead of the standard Euclidean distance in prox-sliding procedure. 

Next, we will briefly talk about the convergence of this method (see the full 
version of the analysis in [15]). First of all, we note the universal technical lemmas 
that form a general approach to working with gradient-free methods for non-smooth 
functions. But before that we introduce a new notation: 


F(x) =E[f@ + re)]. (131) 


F (x) is called the smoothed function of f(x). It is important to note that the function 
F(x) is not calculated by the algorithm; this object is needed only for theoretical 
analysis. The first lemma states some properties of F(x): 


Lemma 5 Assume that differentiable function f defined on Q,_ satisfies 
IV f@)\l2 < M with some constant M > 0. Then F(x) defined in (131) is 
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convex, differentiable and F (x) satisfies 
n 
r 


sup |F(x) — f@) SrM,  VF() =EBe[=fee+reje], IVF Cle S CpavaM, 
xEeQ 


where C is some positive constant independent of n and p, is determined by the 
: ae 
following relation: ¥/E[||e||4] < px. 


In other words, F(x) provides a good approximation of f (x) for small enough r. 


Lemma 6 For LG: &, e) defined in (129) the following inequalities hold: 


Z A : 2A? 
ELA. 8. e1— VF Olle SP, ELA. §. OIE] < 275 Ga ao 


where c is some positive constant independent of n. 


In other words, one can consider f. (x, €, e) as a biased stochastic gradient of F(x) 
with bounded second moment. Therefore, instead of solving (126) directly one can 
focus on the problem 


min Y (x) = F(x) + g(x) (132) 
xXxE 


with small enough r. As mentioned earlier, this approach is universal. In particular, 
the analysis of gradient-free methods for non-smooth saddle-point problems can be 
carried out in a similar way [19]. 

Now we will give the main facts from [15] for zoSA algorithm itself. The 
following theorem states convergence guarantees: 


Theorem 16 Suppose that {p;}1>1, {Or}1>1 are 


t 2(t+1 
5. 2¢+) 


= bt : tA, 133 
Pr 5 : 143) fora (133) 
N is given, {Bx}, {vc}, {Tk} are 
oe ; CNp? (nM? + 3") ke? 
B= w= Egy T= BE == 


with D = 3P9.v/4, Day = max{/2V(, y) | x,y € Q}, Do = max{\|x — yll | 
x, y € Q}, with some positive constant C. Then for all N > 1 


12L D2 nAD 
1Y (Fy) — W(x") < ay eee 
[Vxn)- VO) < NV+ D 


Finally, need to connect the result above to the initial problem (126). 
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Corollary 6 Under the assumptions of Theorem 16 we have that the following 
inequality holds for all N > 1: 


2 
I2LDoy  nADgpx 
N(N +1) r : 


D[Wo (Xn) — Wo(x*)] < 2rM + (135) 


From (135) it follows that if 


g e2 
=O0|(—), A=O - 
. (aa) aoe =) 


andé = O (./nM Do), then the number of evaluations for V g and f, respectively, 
required by Algorithm 26 to find an &-solution of (126), i.e., such xy _ that 
'[Wo(xw)] — Yo(x*) < «, can be bounded by 


LD2 LD2 
QV | and O ON os 


Do y pn? 
an 


€ € 2 


(136) 


It is interesting to analyze the obtained results depending on p*, and these constants 
are determined depending on what geometry we have defined for our problem. For 
example, if we consider Euclidean proximal setup, i.e., || - || = || - |l2, V@, y) = 
lx _ ylls, Do,v = Dg. In this case we have p, and bound (136) for the number 
of (127) oracle calls reduces to 


a) 2 af2 
b LDo mn DonM 
€ 2 


and the number of V g(x) computations remains the same. It means that our result 
gives the same number of first-order oracle calls as in the original Gradient Sliding 
algorithm, while the number of the biased stochastic zeroth-order oracle calls is 
n times larger in the leading term than in the analogous bound from the original 
first-order method. In the Euclidean case our bounds reflect the classical dimension 
dependence for the derivative-free optimization (see [92]). 

But if we work on the probability simplex in R” and the proximal setup is 
entropic: V (x, y) is the Kullback—Leibler divergence, i.e., V(x, y) = )77_, xi In 7 
In this situation we have Dg y = /2logn, Dg = 2, px = O (log(n)/n) [56]. Then 
number of V g(x) calculations is bounded by O (v (L log” n) e). As for the number 


of fF. (x, €, e) computations, we get the following bound: 


Llogn M*log?n 
O + 5 : (137) 
€ & 
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4.1.2 Strongly Convex Case 


In this section we additionally assume that g is jz-strongly convex w.r.t. Bregman 
divergence V(x, y) [151], ie., forallx, ye @Q 


g(x) = g(y) + (Va(y),x —y) + UV, y). 


The authors of [15] use restarts technique and get Algorithm 27. 


Algorithm 27 The multi-phase zeroth-order sliding algorithm (M- zoSA) 
Input: Initial point yo € Q and iteration limit No, initial estimate pg (s.t. YW (yo) — W(y*) < po) 
fori =1,2,..., I do 
Run zoSA with x9 = y;-1, N = No, {p;} and {6,} in (133), {B,} and {yx}, {7} in (134) 
with D = 0/42', and y; is output. 
end for 
Output: y;. 


The following theorem states the main complexity results for M- zoSA. 


Theorem 17 For M-ZoSA with No = 2[./>4/u| we have 


| 0  2nAD 
[Y) — YON) s Fe + OP, 


Using this we derive the complexity bounds for M- zoSA. 


Corollary 7 For all N > 1 the iterates of M- ZOSA satisfy 


y po  2nADop 
“[Wo(yi) — Yo(y*)] < 2rM + i et 


(138) 


From (138) it follows that if 


2 
r=0(—), A=0 = 
M nM Dg min{ px, 1} 


ande = O (./nM Dog), then the number of evaluations for V g and i respectively, 
required by Algorithm 27 to find a €-solution of (126) can be bounded by 


i a p2nM? 
O — log, max [1, -0/e]], O — logy max [1, /0/e] + ——— _]. 
lb lL [Le 
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4.1.3 From Composite Optimization to Decentralized Distributed 
Optimization 


Finally, we get an estimate for solving the decentralized optimization problem. With 
the help of (124) and (125), we reduce the original decentralized problem to the 
penalized problem. Next, we need to define parameters of f using parameters of 
local functions f;. Assume that for each f; we have ||V f;(xi)|l2 < M for all 
xi € Q, all fj are convex functions, the starting point is x, = (%,.--,%9)! 
and x} = (x,,...,x,1)! is the optimality point for (124). Then, one can show 
that ||V f(x)||2 < “/Vm on the set of such x that x1,...,x%m € Q, Dom = mD2,, 
Dom y = mDo y and Ry from (125) is Ry < “7/5, 
in the Euclidean case: 


(w). And we have estimates 


x(W)M2D? 


5 Q communication rounds and 
E 


x(W)M?Do nD*, M2 


Q 
e2 7 


5 calculations of f (x, €) per node. 
€ 


At the same time, when we work on a simplex and use the Kullback—Leibler 
divergence, we get estimates similar to (137): 


W)M? logn 
O a communication rounds and 
E 


x(W)M? logn n M? log? n 


O 
2 e2 


calculations of fi (x, €) per node. 


The bound for the communication rounds matches the lower bound from [136, 
137] and one can note that under above assumptions the obtained bound for zeroth- 
order oracle calculations per node is optimal up to polylogarithmic factors in the 
class of methods with optimal number of communication rounds (see also [32, 51]). 
In particular, in the Euclidean case, we lose n times (which corresponds to the case 
if we were to restore the gradient in the way (128)), and in the case of a simplex, 
only in the logn times. 
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On Training Set Selection in Spatial Deep @® 
Learning dates 


Eligius M. T. Hendrix, Mercedes Paoletti, and Juan Mario Haut 


Abstract The careful design of experiments in spatial statistics aims at estimating 
models in an accurate way. In the field of spatial deep learning to classify spatial 
observations, the training set used to calibrate a model or network is usually 
determined in a random way in order to obtain a representative sample. This chapter 
will sketch with examples that this is not necessarily the best way to proceed. 
Moreover, as in some cases windows are used to smooth signals, overlap may occur 
in the spatial data. On the one hand, this implies auto-correlation in the training set 
and, on the other hand, a correlation among pixels used for training and for testing. 
Our question is how to measure such an overlap and how to steer the selection 
of training sets. We describe an optimization problem to model and minimize the 
auto-correlation. A simple example is used to capture the concepts of design of 
experiments versus training set selection and the measurement of the overlap. 


1 Introduction 


Design of experiments has a long tradition in spatial statistics, [9], and can also be 
found in studies for training neural networks, [4]. The idea is to sample in those 
areas, such that given a model which requires parameter estimates, one intends to 
have estimates as accurate as possible. The training of classification models from 
hyperspectral data does not follow the tradition to select measurement points in a 
deterministic way following certain criteria. Rather the tradition is to sample from a 
set of pixel observations in a random way the so-called training set. To be precise, 
the idea is to use stratified sampling such that a representation is available of all 
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classes, see [8]. We will sketch with a simple example that this is not necessarily the 
best way to proceed. 

One of the challenges in the pixel selection is due to the use of frames of a 
certain window size in order to average out noise. As such, the idea to include and 
average signals of neighbour pixels is a good idea for noise reduction. However, 
this leads to overlap of pixel information used among the training set, when the 
selected pixels are near. This phenomenon of auto-correlation among experiments 
was already mentioned in [3]. Moreover, it reports that the overlap with the test set 
used for measuring the quality of the classifier provides an underestimation of the 
error, as partly the same information is used. 

More recently, we observe a discussion in the community that mentions the 
problem but mainly comes with heuristic sample procedures that still select pixels 
in a random way according to criteria and then applies suggested procedures on the 
standard benchmarks in literature, see [5, 7, 8]. It appears that no lesson is learned 
from the strategy followed by statisticians to create criteria that relate the quality 
of the estimated parameters to the design of experiments by using a deterministic 
selection strategy. 

From a complexity point of view, the latter neither is attractive. Selecting 
P measurements out of a K pixel image is theoretically not exponential, but 
polynomial. However, we are dealing with a bi-criterion problem, where, on the 
one hand, we would like to have a representative set by selecting proportional from 
each class and, on the other hand, we would like to minimize the overlap within the 
training set, as this is related to the dependence within the measurements. 

Our objective is to illustrate that stratified random sampling does not necessarily 
provide best parameter estimates and at least to introduce a measure for the overlap 
appropriate for the selection of training sets. We analyse the computational effort 
to obtain best training sets according to different criteria and develop selection 
heuristics to come to approaches which can be used in practice. 

To investigate this objective, we first sketch the challenges based on a simple 
numerical example in Sect. 2. From there we will formulate the training set selection 
problem as a bi-criterion MILP problem in Sect.3. Section 4 summarizes our 
findings. 


2 Selection Criteria Illustration 


Classification of observations uses data with many characteristics, such as more than 
224 bands of the airborne visible/infrared imaging spectrometer (AVIRIS) images. 
The potentially high dimensional input signal z,,k = 1..., K which may contain 
noise, is associated with a class cy. The early models on machine learning via 
nonlinear regression were based on sigmoidal functions like logistic regression. For 
such models, we know since the early works on nonlinear regression like [1] that 
estimated parameter values and with that a model is more accurate, if we sample in 
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those areas that define the inflection points from one class to the other. How can we 
translate these findings to nonlinear models based on neural nets? An early work is 
due to [4], who studies a shallow network with two hidden nodes, an input z, and 
output yz. Based on the underlying theory of statistical design of experiments, she 
comes to what is called an optimal design consisting of places on which to measure 
and how many times. 

Representing the neural network as a regression function f(@, z), one follows 
the statistical model 


y=f(@,z) +8, (1) 


where ¢ is noise and @ the parameter values to be estimated minimizing a traditional 
Root Mean Square Error (RMSE) criterion 


RMSE(@) := D(x — cx)” = DFO, ze) — cx) (2) 
k 


k 


based on observations (zx, cx). An example of such spatial observations is depicted 
in Fig. 1. On the left side are the observations with noise and on the right side the 
classes the pixels belong to. 

The corresponding least square error concept has been extended to training of 
neural networks using a gradient step idea backward into the network to adapt 
parameter values. Training of a network usually refers to such gradient based 
steps. The RMSE concept and the corresponding design of experiments is based 
on model (1), which assumes we have noise on the output cx of the observations. 
However, the situation of classifying practical earth observations we are interested 
in is characterized by noise on the input signal and not on the class a pixel is assigned 
to. Perhaps for this situation, the RMSE criterion for estimating the parameter 


Fig. 1 Measured spatial (one-dimensional) data z, = dy + ¢% with e, realization of e ~ N(0, 25) 
on the left and classes cx on the right for an artificial 34 x 38 pixel instance 
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values and the interesting design of experiments enhancing the pixel selection is 
less adequate. Our aim is to sketch this phenomenon with a small example. 


2.1 Training as Parameter Estimation 


We consider the following model where z,; = dx + €, and (zx, cx) are observations 
from ground truth data (dx, c,) defined by classifier 


IC|-1 
cy = 1+ round > 
i=l 


1 
1 + eFi—% 


(3) 
where C is the set of classes. The idea of training is to obtain values for the 


parameter vector 0 = (w1, w2,..., Wicj—1, D1, b2, ..., bjcj-1) for a shallow neural 
net classification model 


|C|-1 1 
y=14| Damm |} @) 


i=1 


feeding it with observations of a training set (zz, ck) € S. An example is given for 
4 classes in Fig. 2. We now consider the parameter estimation as the minimization 
of RM SE(@) for this model 


min Y (vb, w, zx) ~ ex)”, (5) 
keS 


where y(b, w, zx) is the model outcome of (4) given parameter values w, b and 
observation zz. A recent overview of many characteristics of the optimization 


Fig. 2 Shallow classification 
neural network with 6 
parameters 
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problem in supervised deep learning is given in [10]. We will limit ourselves to 
several remarks with respect to (5). 


Symmetry As found by many researchers, e.g. [4,6], any parametrization of model 
(4) has (|C| — 1)! equivalent parametrizations as the nodes in the hidden layer 
can be interchanged. In statistics this phenomenon is called identifiability, i.e. the 
optimal parametrization is not unique, if the parameters are non-identifiable. We 
should consider here that we are talking about a finite number of alternatives. 
From an optimization point of view, one can add the so-called symmetry breaking 
constraints, i.e. force, for instance, bj < bo < b3,.... 


Over-Parametrization This qualification is used a lot for the question on how 
many nodes to add to a deep network. It is related to the identifiability issue with an 
infinite number of alternatives. Consider our model with an additional parameter v 
for the activation function in each node, such that 


IC|-1 1 
y=1+ oe ; (6) 


i= 


For node parameter values (v;, bj, w;) an equivalent representation is given by 
(1, bj + In(v;), w;). This means, the model has a parameter that as well can be 
left out. It may be clear that the identifiability analysis is not easy when using 
deep neural networks with many levels and nodes. From the point of view of 
deep learning, the idea of an artificial neural network is that part of the brain or 
neurons can take over the job of other parts, so indeed the lack of identifiability is a 
characteristic of such models. 


No Optimal Parameter Values Consider for a moment the observations zy = dx 
without noise. The RMSE problem (5) seems well defined with a lower bound of 
0, if we would reach the complete fit. Notice that where the ground truth data as 
sketched by the blue line in Fig. 3 has classification values of observations cg, with 
an integer value, the neural network classifier function (4) provides fractional values 
yx, Sketched by the black dots in Fig. 3. A perhaps surprising observation is that any 
“optimal” parameter value for (w;, b;) can be multiplied by a large positive constant 
which brings the model values y, closer to the integer values c;. Practically this 
means that RMSE problem (5) has no real optimum. 

Due to the last observations, in the sequel, we will put the arc weights w; = 1 
and focus on the estimation of the bias parameters b;. We first introduce an example 
to illustrate the estimation with a randomly generated training set. 
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Fig. 3 Classifier that classifies data on the x-axis towards the classes {1, 2,3, 4}. In blue the 
ground truth relation, in black the estimation y according to the neural net with the correct 
parameter values, green circles are samples from each class and in red several extreme class 
samples (zx, Ck) 


2.2 Training Set Selection and Parameter Estimation 


The basic question is what would be a good training set, e.g. which pixels to 
select from Fig. 1 in order to estimate the bias values bj, b2, b3 in our base case. 
Traditionally, selection of the training set to calibrate the classifier was done by 
stratified random sampling. Let nm, be the number of pixels in class c, where 
Yi nce = K. One can generate a set of p = |S| random samples by selecting 
Po ® pene pixels randomly from class c. A sketch of random sample points (zx, cx) 
is given by green circles in Fig. 3. For a reason that will become clear, the samples 
are limited to pixels that are a bit off the boundary of the image in Fig. |. For the 
example case, it also means that the highest z, values are not sampled. 


Example I In the illustrative example, the measured characteristic dg, is one 


dimensional and taken as dy = (xz — x)? + OK - y)’, k = 1,..., 1292, where 
x and y are coordinates and x and y the corresponding mean value in the image. 
Notice that we add noise to the observation z, = d, + &,, where the variance 


has been chosen to correspond to the signal noise ratio of 50:1 of the airborne 
visible/infrared imaging spectrometer (AVIRIS). The classification towards class cx 
is done by the ground truth classifier (3) with bias values b = (84.5, 204.5, 338.5). 
The data zx and cx is depicted in Fig. 1. The number of pixels in each class is 
given by n = (256, 348, 292, 88). A sample of about 10% is stratified towards 
P = (26, 38, 29, 9) of in total p = 102 samples from a candidate set with L = 1020 
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pixels which are further than 2 pixels from the boundary of the image. As said, a 
stratified random sample set S is depicted by green circles in Fig. 3. Moreover, they 
are also depicted in the image in Fig. 5. 


We can now illustrate the idea of estimating the bias values minimizing RMSE 
given the stratified random sample S of 102 pixels. For this we minimize keeping the 
constraint b} < b, < b into account taking a starting value according to quantiles 
of zx. Running the routine fmincon of matlab2015 results in an estimate b= 
(80.9, 205.5.3, 333.4) and a correct classification of the test set (all pixels excluding 
the sample) of 98%. Figure 4 provides an image of the (non)smoothness of the 
RMSE objective function. When we take the constraints into account, we are not 
dealing with a global optimization problem here, although the large variation in 
derivative values usually makes it difficult for practical optimization algorithms to 
reach the optimum. 

As we discussed before, actually the RMSE criterion is not completely correct, 
as we do not encounter noise on the classification, but on the input signal itself. 
Specifically for the bias values, one can consider the minimum and maximum 
signal values within each class. If there would be no noise in the signal, one can 
estimate the bias values from the minimum (and maximum) signal values of a 
sample set following the extreme order statistics ideas as outlined in [2]. He showed 
that actually 3 extreme ordered values are relevant. We illustrate here a heuristic 
estimate based on the 3 extreme values in each class. This gives a complete sample 
of 24 values (zx, cx) as depicted by the red stars in Fig. 3. By taking the average of 
the three minimum points z(1), Z(2) and z(3) of class c + | and maximum points 
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Fig. 4 Loss function values over the sample set as function of bias b, in blue, b2 in red and b3 in 
yellow, while the other bias values are taken from the starting value 
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of class c (which all contain noise) for estimating b,, we obtain an estimate of 
b= (81.1, 206.7, 334.1) which is similar. The test set is larger now, as this heuristic 
procedure only requires 18 points for the example and the correct classification of 
the test set is 99%. 

From this procedure, we derive that, in any selection, it is good to try to obtain a 
large variation from the mean class value of the input signal of each class in a node. 
For the sequel, we are going to measure this as an in-class variation criterion 


2 
V(S) = y- 2teS cinch Me) (7) 
ceC MeP 


where /Z; is the average of the signal values z, in class c. 


2.3. Auto-Correlation due to Overlap 


In the field of hyperspectral data, the classification is potentially hindered by noise. 
Therefore, pixel data z, are often gathered from aggregation of observations in 
a window around pixel k. For the ease of analysis we consider all pixels within 
a window of size w to belong the window, ie. all pixels @ with max{||x~g — 
Xl, lve — Yel} < ot are assumed to be used for the aggregation in the window. 
The traditional (stratified) random sampling with the argument that this provides 
an independent and representative set of observations has been commented a lot 
recently with respect to the independence concept. As mentioned, [3] were one of 
the first to point at the difficulty of having independent observations if windows 
overlap. Therefore, [5, 7, 8] also discussed this point and came with alternative 
heuristic sampling techniques. 

The overlap between test set and training set can spatially be realized by splitting 
the image into two parts, which both should be representative with respect to the 
input signals. We will also locate the training set candidates at a distance of wot 
from the border of the image, such that the complete selected windows fall into the 
training image. 

Our argumentation is to use a sample of like 10% of the candidate pixels which 
is representative in terms of criterion (7), but where measured window data overlap 
as few as possible. Rather than proposing new sampling ideas, our question deals 
with quantifying the exact overlap of a selection S. To do so, we first introduce the 
Boolean cover matrix 


(8) 


ines 
Cx,e = max(|x~ — Xe)|, lye — yel) < 7 
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which specifies that pixel k is covered by the window around pixel £. For each pixel 
k we define the overlap as the number of other selected pixel windows it is covered 
with: 


+ 
dk = (x Crede — ) 


LeS 


where x* := max{x, 0} and 6, the selection variable, i.e. 5g = 1 if pixel € € S and 
0 otherwise. As the criterion of the overlap of a selection S with a window size of 
w we define the criterion 


O(S) = Ee (9) 


where p := |S|. One can read the formulas as the average number of times that 
a pixel is used more than once. Implicitly, such a criterion minimizes the not seen 
pixels in the training part of the image. 


Example 2 For our example, we use a window size of w = 5 and construct the 
matrix C for the K = 34 x 38 pixel image, of which L = 30 x 34 candidates are at 
the adequate distance from the border. Figure 5 depicts the sample intensity of one 
random sample in grey scale. More grey exhibits more overlap. In white, the pixels 
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Fig. 5 Random sample of 102 pixels from candidates at a window distance of the border and the 
graphical sample intensity in grey. White pixels are not used by selected windows and all pixels 
with more grey intensity than the lightest have overlap 
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Fig. 6 Total in-class variation and average overlap for 30 samples randomly generated in a 
stratified way in black. In red, several deterministically selected selections, among which an highest 
variation selection 


are given that are not used (seen). Notice that for any selection of p = 102 samples, 
2550 pixel signals are used of which many are repeated due to the overlap. 

We generated 30 random samples and measured the average in-class variation 
V(S) as criterion of the accuracy of the sample and the average overlap according 
to O(S) of each sample. Figure 6 tells us that with a random sample, of the 25 pixels 
in a window, on average a pixels is used on average 1.15 times more than once by 
the selection. 

We also ordered the contribution to the inter-class variation in each class c and 
selected the P, highest variation pixels in each class. The result is indicated in Fig. 6 
by the highest red square. The example illustrates that going for a high variation 
selection not only improves the variation within the samples but may also lead to 
less overlap than random sampling. On average, the measured overlap criterion is 
1.13. 


Given the two criteria, we now formulate the selection of the training set as a bi- 
criterion MILP model in Sect. 3. 


3 Model 


Based on the developed criteria, we present a Mixed Integer Linear Programming 
(MILP) model to select the training set in an optimal way. We present the model in 
Sect. 3.1 and illustrate it in Sect. 3.2. 
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3.1 Model Formulation 


One of the challenges is to obtain an equivalence of stratified sampling, i.e. we want 
to select P, samples from each class such that they are proportional to the occurrence 
in the total image of K pixels. For this, the total number of samples should at least 
be p => |C|. Remember that n-; is the number of pixels of class c. Basically, one 
intends to stratify the sample in sub-samples of size P, such that 7 x e. 


Indices 
k 
£ 
c 


Parameters 
Ze 
Me 


Variables 
d¢ € {0, 1} 
qk = 0 


Pixel index,k = 1,...,K 
Candidate sample pixel index, €¢€ £C {1,..., K} 
Index for classes 


Signal at pixel ¢ 

Average of signal z in class c 

Contribution to in-class variation sg = (sce) of pixel £ 

Window size 

K x |£| matrix. Entrance Cxe = 1 if window of pixel ¢ covers pixel k 


Class pixel k belongs to 
Number of pixels to be selected for the training set from class c 


Selection of pixel ¢ in training set S 
Amount of overlap in pixel k 


This means that the selection set S has been stratified and is presented by a binary 
variable vector 6 with p values of 1. The MILP bi-criterion optimization problem is 


max V, min O 


1 
st V= — > 5 sede 
lel 
es 
O= goat 
ea (10) 
gk = >) Conde —1, k= 1,...,K 
leL 
uloner. eed 


leL 
8¢ € {0,1}, £E Lge > 0, kK=1,...,K. 
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3.2. Model Illustration 


For the maximization of inter-class variation V without taking the overlap into 
account, the pixels can be ordered in each class according to its contribution s¢ 
to the variation and one can select in each class the P, pixels that correspond to the 
highest values. The configuration for the artificial example is depicted in Fig. 7 on 
the left. One can observe that the pixels of interest as green squares are chosen at 
the boundary of each class area. 

The intensity of the grey scale provides the number of times data of a pixel is 
used in the selection. Lange et al. [7] mentioned for some case that a 10% training 
sample leaves only 1% of the pixels as not seen. This depends of course on the size 
of the window w. Our random samples left on average 8% of the pixels as not seen. 
The maximum variance selection leaves 7% as not seen. 

The other red square selections in Fig. 6 were generated running a Mixed Integer 
Linear Programming (MILP) code on the model minimizing a combination of the 
objectives, i.e. —V + BO, where f is varied B = 0.005, 0.01, 0.02, 0.04, co. It 
illustrates that by leaving a bit of the variation, one can reduce the average overlap 
for this example to about 0.98. Figure 7 shows on the right the configuration of 
one of the minimum overlap selections. As one can observe, the selection goes for 
the boundary of the map and leaves only 5 pixels as not seen. This illustrates the 
tendency that overlap reduction leads to spreading more of the samples and thus 
using more of the pixel data in the selected windows. 


a o ts) o op 5 5 o 
oo o =] o o 
a o a 
a a o 
o 
a a a 2 bs a 
oo oo o fa] ii) go a 
o J og o 
1 a 2 
= ra a oo 5 ns 
a 5 o 5 =] te! a G 
5 a Fs 
: 9 o a oo o 
> o " o 
2 =] 5 5 
a 0 al i ] 
o o 
ss] o ta] o al 
a” S a a 
a S o 
isi o i} | 5 bel 
a tad a0 of 
a 6 te oo a oo 
o c a 5 
a a a 5 
a Oo 6 ea 5 
1 o o 5 B 
a no) Bo o B i} oo o a o 


Fig. 7 Selections. On the left the maximum variation selection and on the right a minimum overlap 
selection. Selected pixels with a green square. The grey scale indicates how many times data of the 
pixel is used by the selection. In white the not seen pixels 
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4 Conclusions 


Training set selection in deep learning is related to the concepts of design of 
experiments in statistical models. The tradition in spatial deep learning models is 
to use stratified samples for the training set selection. First of all, we illustrated 
that this may ignore the concept of maximizing variation in order to obtain better 
infliction point selection. Second, a profound discussion is going on recently on 
the auto-correlation aspect when using sample windows in order to reduce noise 
in input signals. We presented an exact criterion to measure the overlap within a 
training set. Moreover, our intention is to illustrate that a deterministic selection of 
the training set by an optimization model may consciously reduce overlap and with 
that the auto-correlation within the training data. 
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Surrogate-Based Reduced-Dimension (®) 
Global Optimization in Process Systems sai 
Engineering 


Kody Kazda and Xiang Li 


Abstract High dimensional global optimization problems arise frequently in pro- 
cess systems engineering. This is a result of the complex mechanistic relationships 
that describe process systems, and/or their large-scale nature. High dimensional 
optimization problems can often be more easily solved by instead solving a 
sequence of reduced-dimension subproblems. Surrogate models can allow the 
formulation of reduced-dimension subproblems by approximating the key features 
of the original model. Surrogate-based optimization (SBO) is to use surrogate 
modeling to solve a sequence of approximate reduced-dimension subproblems, 
in order to converge to a high quality solution to the original high dimensional 
problem. Here we review the key characteristics of SBO frameworks and their 
application to process systems optimization problems. 


1 Introduction 


Global optimization in process systems engineering is a computationally chal- 
lenging field because the types of problems encountered often possess both non- 
convexity and are high dimensional [1]. When non-convexity is present, numerical 
methods that only possess local information, such as gradients and Hessians, 
cannot be expected to yield more than local solutions [2]. Therefore, we must 
employ procedures that iteratively tighten global bounds on the optimal value 
to identify regions of the search space that can be removed as candidates for 
possessing an optimal solution [2]. These global procedures become exponentially 
more computationally challenging as the dimension of the search space grows [3]. 
This is due to both the increased challenge of applying numerical methods to higher 
dimensional matrices, and the expected exponential increase in the number of areas 
of the search space to consider as candidates for possessing an optimal solution. 
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The non-convexity and high dimensional nature of optimization problems in 
process systems engineering are often unavoidable. The non-convexity arises from 
both mechanistic relationships describing the process system, and the discrete 
decisions that are often considered [1]. The nature of the high dimensionality is 
dependent on the application. Below we discuss several common process system 
optimization applications and how they result in non-convex and high dimensional 
optimization problems. 

Process scheduling applications are concerned with determining the optimal 
resource allocation sequence and durations to complete production tasks. Allocation 
decisions are naturally discrete, and the relationships governing the production tasks 
are often nonlinear, resulting in mixed-integer nonlinear programs (MINLP). To 
formulate scheduling problems, the time domain must be discretized into intervals 
of fixed or varying length, where allocation decisions are made for each interval [4]. 
The resulting discretization of decisions over the time domain often results in high 
dimensional problems. The dimension can grow further if uncertainty in the process 
parameters is considered in the optimization problem. Considering uncertainty in 
the process parameters can be important because often these parameters are not 
known exactly, and if modeled as such, the optimal solution found may be infeasible 
in practice when the actual parameter values are realized. This is often addressed 
by modeling scheduling problems as two-stages stochastic programs [5]. The two- 
stage method typically models the scheduling problem such that resource allocation 
decisions are made in the first stage without knowing the realization of the uncertain 
parameters. The second stage considers a scenario for each possible realization of 
uncertainty. In each scenario recourse decisions are made with the parameter value 
realized. This method determines a single set of first-stage decisions that are optimal 
in expectation over the parameter uncertainty, and multiple sets of second stage 
decisions to react optimally to each possible realization of uncertainty [6]. 

Physical network applications are concerned with transporting some goods 
(electricity, water, gas) from one location to another as efficiently as possible 
[7]. The relationships governing the transportation of the goods throughout each 
network arc (such as a pipeline) are often nonlinear, and some of the equipment 
operation decisions at network nodes (arc junctions) are often discrete (on/off), 
resulting in MINLPs. Practical networks can contain thousands of arcs and nodes, 
making the problem high dimensional. As in the case of scheduling problems, 
considering uncertainty in network parameters results in the dimension of the 
problem growing further. 

Process synthesis seeks to determine unit operations, connectivities, and oper- 
ating conditions to meet certain product requirements [8]. Traditional sequential 
methods use rules of thumb based on the natural hierarchy among engineering 
decisions during the generation of a process flow-sheet. These methods enable 
the sequential solution of smaller subproblems to arrive at the process flow-sheet 
[9]. By disregarding the full-interaction between decisions at different stages, 
the computational complexity of the process synthesis is reduced. In contrast, 
superstructure optimization-based methods model a superstructure of all potentially 
useful unit operations and interconnections, and use optimization methods to 
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search through these structures and associated operating conditions for the most 
cost effective process flow-sheet [9]. Superstructure methods have the potential to 
identify superior process flow-sheets than traditional sequential methods, but are 
limited by computational complexity. The complexity arises from the number of 
combinations of process structures, and the nonlinear physics governing the unit 
operations, resulting in high dimensional MINLPs [10]. 

Given the computational difficulty of solving non-convex and high dimensional 
optimization problems discussed previously, specialized solution methods must be 
used that solve sequences of reduced-dimension, and possibly convex, subproblems. 
By only searching a reduced-dimension search space, the challenges associated with 
searching high dimensional search spaces are resolved. With the reduced-dimension 
subproblems being solved many times, the reduced-dimension search space must be 
searched repeatedly. Overall, a significant reduction in computational expense can 
be achieved by repeatedly searching the reduced-dimension search space, versus the 
original high dimensional search space. 

In this paper, we review Surrogate-Based Optimization (SBO) methods for 
solving high dimensional process system optimization problems. SBO methods 
are one of several well-known methods for solving high dimensional optimization 
problems using reduced-dimension subproblems. In Sect. 2 we review the common 
model types found in process system optimization applications. In Sect. 3 we discuss 
how SBO methods are used to solve high dimensional optimization problems that 
are modeled using any of the common model types. 


2 Model Types in Process Systems Engineering 


We begin by describing the three common modeling approaches used in process 
system optimization. Once the common modeling approaches have been described, 
we can outline the methods of solving high dimensional global optimization 
problems that involve these types of model. 


2.1 White-Box Model 


When the process model is described by known mechanistic relationships that 
are derived from first-principles, it is referred to as a white-box model [11, 12]. 
An example of an expensive-to-evaluate white-box model might be a large set 
of differential algebraic equations. For instance, in the scheduling of a sequential 
batch process, it may be desirable to consider the operating trajectories of chemical 
reactions through a set of differential algebraic equations [13]. The mechanistic 
relationships describing white-box models often contain assumptions and simpli- 
fications to compensate for the lack of understanding of all the physical phenomena 
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governing the process, or the inability to handle the full complexity of the resulting 
problem [11]. 


2.2 Black-Box Model 


Black-box models can be used to overcome the limitations of white-box models. 
Black-box models often use process operating data, data from experiments, or 
proprietary codes to build the process model [14]. Depending on the source of the 
data, it can either describe a stochastic or deterministic input-output relationship. In 
any case, the mechanistic relationships describing the input-output model may not 
be known, but the input-output model can be described through statistical treatment 
of the dataset [11, 15]. Statistical tools such as artificial neural networks (ANN) 
[16], multivariate adaptive regressive splines [17], and kriging [18] might be used to 
form the empirical process model. The empirical models are typically fitted to the 
data according to a minimization criterion that is related to the difference between 
predictions and observations [11], or can be found using explicit functions of the 
dataset (e.g. kriging) [19]. While black-box models can be useful for overcoming 
the limitations of a white-box model when the process is not fully understood or the 
complexity is too large to handle, they too possess drawbacks. The main drawback 
is the need for process data to describe the system, making these models poor 
predictors in regions with insufficient data [11]. Elaborate and expensive sampling 
techniques may be required to build a black-box model, thereby making first- 
principles driven white-box models more attractive in certain cases. 


2.3 Gray-Box Model 


Gray-box models, or semi-physical models, propose to overcome the limitations 
of both white and black-box models, while maintaining their benefits. Gray-box 
models are to form a process model where some components are first-principles 
driven white-box models, and some components are data driven black-box models. 
In doing so, first-principles knowledge about the system can be used when it is 
available. As a result, the black-box models are likely to require less process data, 
and to form better predictions outside regions where process data has been collected 
[11]. An example where gray-box modeling has been applied is the optimal design 
of energy systems. Energy systems often include expensive finite elements, a large 
number of partial differential equations, and high-fidelity mechanistic models. It is 
often the case that some of these components exist as propriety simulations that 
simply return an output for any input, therefore lacking derivative information [14]. 
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3 Surrogate Modeling for Solving Reduced-Dimension 
Subproblems 


A surrogate model is one that approximates a true model, with the important 
properties of being inexpensive-to-evaluate in comparison, and possessing gradient 
information for every point in the input domain [20]. In the case where the true 
model is described by data, this definition is valid given that sampling new data can 
be an expensive and time consuming process. Surrogate-based optimization (SBO) 
is broadly defined as optimization techniques that make use of surrogate modeling 
techniques [21]. 

The most obvious application of surrogate-based optimization is when the 
process model is a black-box, given that an analytical expression is needed for 
local or global optimization. By fitting the surrogate model to the black-box data, 
an analytical expression describing the black-box model can be obtained. In this 
case, SBO might instead be referred to as black-box optimization. Given that some 
components of any gray-box model are black-box models, it follows that SBO is 
needed when the process model is a gray-box. In this case, SBO might instead be 
referred to as gray-box optimization. 

SBO can also be useful when the process model is an expensive-to-evaluate 
white-box model. In this case, although analytical expressions are available to 
describe the mechanistic equations governing the process, certain lower dimensional 
input-output relationships that would be less complex to describe the process system 
might not be described by the mechanistic relationships. When an expensive-to- 
evaluate component exists in the process model, it is often the case that many of 
the relationships considered within the component are not directly relevant to the 
process decisions being considered, despite their important role in the component. 
That is, the model component may describe an n input to m output relationship, 
while possessing additional intermediate variables and equations to describe this 
relationship. This is depicted in Fig. la. Figure 1b demonstrates how a surrogate 
model can be used to effectively reduce the dimension of the expensive-to-evaluate 
component. By employing a well designed surrogate, the mechanistic relationships 
considered within the expensive-to-evaluate model component are disregarded, and 
an approximation of only the relationship directly relevant to the process decision 
remains. As a result, the high dimensional mechanistic model is replaced by a 
reduced-dimension surrogate model [22]. 

We can mathematically describe the reduced-dimension surrogate optimization 
subproblem as follows: 


. AY 
man f(x) 
s.t. gi (x) <0 Vme{l,...,M}, (RD-SP) 
ex(x) <0 Vk e{M+1,...,M+K}, 
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Fig. 1 (a) Demonstration of an expensive-to-evaluate model that has intermediate variables z and 
equations t¢(x, f, z) to describe the relationship between input variables x and output variables y. 
(b) Demonstration of a surrogate model f*(x) that only considers the relationship between the 
input variables x and output variables y, thereby reducing the computational expense of evaluating 
the relationship 


where f*(x) is a surrogate model approximation of the original objective function 
f(x), M represents the number of constraints that are approximated by surrogate 
models g*(x), and K represents the number of constraints that are kept from the 
original problem. 

The key steps found in any SBO framework are: 


. Selection of surrogate type (e.g. polynomials, kriging, ANN). 

. Initial sampling of data from the process model (referred to as sampling plan [23] 
or design of experiments [21]). 

. Surrogate model identification and validation. 

. Optimization of RD-SP using the current surrogate models. 

5. Augmentation of existing sample data with new sample data that is selected 

according to an infill criteria (referred to as sample-point refinement [21] or 

adaptive sampling [23]). Then return to Step 3, or terminate the SBO procedure. 


Noe 


BW 


We note that the well-known field of Bayesian optimization encompasses 
optimization procedures that are made up of some variation of Steps 1-5, applied 
to black-box models. We refer the reader to Shahriari et al. [24] for an extensive 
review of Bayesian optimization techniques [24]. 

The most basic SBO framework does not include Step 5, thereby removing the 
iterative component of the procedure. The benefit of this basic framework is that the 
surrogate model only needs to be identified once. The optimal solution to RD-SP 
is an approximate solution to the original problem. The drawback of the basic 
framework is to provide a high likelihood that the optimal solution found through 
SBO is a high quality solution to the original problem, the initial sample of data 
will need to be large to ensure the entire search domain is adequately covered. This 
can be problematic when the true process model is expensive-to-evaluate, or the 
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computational expense of identifying the surrogate model scales poorly with the 
number of data points. Given that one or both or these issues are likely to arise 
for typical applications of SBO, the most common SBO frameworks include the 
iterative component provided by Step 5. 

The iterative SBO framework begins with a surrogate model that provides a 
coarse description of the true process model. As the iterative procedure progresses, 
the surrogate model is refined only in the regions of search domain that are likely 
to improve the quality of the surrogate solution [19]. The goal of this procedure is 
to be sample efficient, or in other words, to require as few samples as needed to 
determine high quality solutions from the SBO procedure. The computational cost 
of being sample efficient is the requirement of repeatedly solving RD-SP. For many 
SBO applications, the computational savings associated with being sample efficient 
outweigh the cost of repeatedly solving RD-SP, thereby making the iterative SBO 
framework attractive. 

In what follows we will provide further details of the important considerations for 
the steps in the SBO framework described through Steps 1-5. We end the discussion 
with the important considerations when selecting the type of surrogate model to use. 


3.1 Sampling Plan/Design of Experiments 


A sampling plan must be made to generate the initial samples that will be used to 
train the surrogate model. The goal of the sampling plan is to allocate a limited 
number of samples in the approximation domain such that the maximize amount of 
information is acquired [19]. The sampling budget will be determined according to 
how expensive-to-evaluate the samples are. In order to capture global information 
about the true model, sample allocation techniques typically aim to disperse samples 
across the sampling domain. Classics sampling plan techniques such as factorial 
designs were developed for experiments with the goal of limiting the effect of 
random error [21]. Modern sampling plan methods were developed for sampling 
deterministic computer simulations. Perhaps the most popular modern sampling 
plan is Latin Hypercube Sampling, which aims to construct a sampling plan with 
a uniform, but not regular, spread of points across the approximation domain [25]. 
Given p sample points to be allocated, Latin Hypercube Sampling partitions each 
dimension into p bins, thereby yielding p” bins for an n-dimensional input space. 
Samples are then randomly allocated to these bins, while ensuring that for each one- 
dimensional projection of the bins onto the input dimensions, exactly one sample is 
allocated to each bin [25]. For a more extensive review of sampling-plan techniques 
readers are referred to Queipo et al. [26]. 
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3.2. Surrogate Model Validation 


If the surrogate model is being used to approximate white-box model components, 
it is possible to perform rigorous validation that bounds or exactly determines the 
approximation error possessed by the surrogate model. In these cases, it may be 
possible to use the solution of RD-SP to make global optimality guarantees for the 
original problem. Various rigorous validation methods have appeared for white-box 
surrogate models, all reliant on employing sub-optimization procedures. In Kazda 
and Li [27], a piecewise-linear surrogate model is validated by searching across each 
polytope for the maximum difference between the approximation and the analytical 
function being approximated. The resulting sub-optimization problems can be non- 
convex NLPs, but can often be solved to global optimality quickly because of their 
low dimension [27]. Other similar approaches have been proposed for validating 
piecewise-linear approximations, except only seek to bound the approximation 
error. The advantage of these methods is that they avoid having to solve non-convex 
problems. This is achieved by searching over each polytope of the piecewise- 
linear approximation for the maximum difference between the approximation and a 
convex relaxation of the analytical function being approximated [28, 29]. 

If the surrogate model is being used to approximate black-box model compo- 
nents, the surrogate model can only be validated through statistical estimation of 
the approximation error. Some surrogate model types, such as kriging, provide 
predictions as well as estimation of the prediction error distribution at each 
point. For surrogate types without this property, distinct methods of validating the 
prediction accuracy outside the training points exist. A common method is the split- 
sample method, whereby the initial sample data is split into training and test data. 
The training data is used for fitting the surrogate model, and the test data is used 
for computing an unbiased estimate of the generalization error, or the ability for 
the surrogate to generalize outside the training points [26]. The drawback of this 
approach is that the generalization error computed is heavily dependent on how the 
sample is split. Furthermore, when the sampling budget is particularly small owing 
to an expensive-to-evaluate process model, it may be desirable to take advantage of 
all sample data for training the surrogate model [26]. 

To overcome these limitations, the method of cross-validation is frequently 
employed. Cross-validation begins with randomly splitting the samples into k even 
subsets. The surrogate model is then trained k times, each time omitting one of 
the k subsets from the training data, and using the omitted subset to compute the 
generalization error. The overall generalization error is then estimated by averaging 
the k generalization errors. If k equals the number of samples, then this method 
is referred to as leave-one-out cross-validation. Cross-validation is advantageous 
because it provides a nearly unbiased estimation of the generalization error, and the 
variance is reduced in comparison to the split-sample approach. The drawback is 
that the surrogate model must be fitted & times, which may be impractical in some 
applications [26]. 
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Bootstrapping has similarities to both split-sampling and cross-validation, and 
has shown to be the most effective validation method in some applications. The 
general idea of bootstrapping is to draw random subsamples from the sample 
data for fitting the surrogate model and use the remaining data to calculate the 
generalization error. This procedure is then repeated numerous times, each time 
replacing the subsample from the previous iteration. The overall generalization error 
is again computed as an average of the generalization errors computed for each 
subsample [26]. Each iteration is essentially a split-sample, and the repetitive nature 
affords the advantages of the cross-validation method. 


3.3 Sample-Point Refinement/Adaptive Sampling 


Once the surrogate model is built, it can be used for finding solutions that are 
likely to be high quality for the true process model. However, the optimal solutions 
from SBO are always the solution from evaluating the true process model at the 
point suggested from the surrogate model, not the evaluation of the surrogate 
approximation itself. This is to ensure that there is no error in the optimal solution 
suggested by the SBO procedure. The additional evaluations of the true process 
model do not need to only be used to validate the surrogate solution, but can also be 
used to improve the surrogate model and repeat the search procedure. In this way, 
the surrogate model is able to guide the interrogation of the true process model 
to help extract samples that provide the most information about where the best 
solutions to the true process model exist [30]. The extraction of new samples based 
on the combination of the existing surrogate model and an infill criteria is known 
as sample-point refinement, or adaptive sampling, and is seen as the most important 
step to the success or failure of the SBO procedure [23]. As such, several approaches 
to developing the most effective infill criteria have been developed. 

Approaches to adaptive sampling can be classified as pure exploitation, pure 
exploration, or a balance of exploitation and exploration. Pure exploitation 
approaches are based on the assumption that the surrogate model describes the 
true model well over the entire input domain. In this approach, the surrogate model 
is optimized and the optimal solution is evaluated on the true process model to 
acquire an adaptive sample. If the adaptive sample violates a predefined error 
tolerance, the surrogate model is retrained with the adaptive sample included, and 
the procedure is repeated, otherwise the procedure terminates. The assumption that 
the surrogate is accurate over the entire input domain is unlikely to be valid, and so 
this method of pure exploitation is likely to quickly converge to a local optimum. 
The location of the local optimum found is highly dependent on the initial samples 
selected to train the surrogate model [23]. 

Pure exploration approaches seek to use adaptive samples to fill in the areas 
of the search space between existing samples. This can be done with sequential 
space filling sampling plans that do not make use of the surrogate model. More 
common approaches attempt to reduce prediction error of the surrogate model by 
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using an estimate of the prediction error to select adaptive samples. Pure exploration 
approaches are often not ideal for SBO as valuable sampling budget will be 
consumed on areas of the search space that have no merit based on solution quality. 
However, pure exploration can be useful for gaining an understanding of the true 
process model [23]. 

SBO approaches that seek to determine globally optimal solutions need to 
balance the exploitation of areas of the search space that are believed to possess 
high quality solutions, and exploration of the search space in hopes of discovering 
areas that possess high quality solutions. Exploitation without sufficient exploration 
will likely leave areas of the search space with high quality solutions undiscovered. 
Exploration without sufficient exploitation will result in SBO procedures with 
unacceptably high computational expense. The simplest method of balancing 
exploitation and exploration is to use the surrogate model f*(x) for exploitation, 
the estimated prediction variance s*(x) for exploration, and a constant parameter A 
to balance the two [23]: 


min LB(x) = f*(x) ~ As(x) (SBO-LB) 
xe 


If some of the constraints in RD-SP are surrogate models, then relaxing these 
constraints using a similar estimated prediction variance term can be considered. 

If A is chosen to be closer to zero, solving SBO-LB will be more biased 
toward exploitation, and if A is chosen to be larger, it will be more biased toward 
exploration. One drawback of this approach is that it is not obvious how to 
best choose A for any given application. Other more sophisticated approaches to 
balancing exploitation and exploration can be found in Forrester and Keane [23] 
and Shahriari et al. [24]. 


3.4 Selection of Surrogate Type 


When approximating a process model component with a surrogate model, careful 
attention must be given to what surrogate type is chosen. Common surrogate types 
found in process engineering applications are polynomials, kriging, ANNs, and 
radial basis functions [31, 32]. The surrogate model will represent the original 
model component for the purpose of optimization. Therefore, the surrogate must 
be able to represent the original model component with a sufficient level of 
accuracy, and it must also be convenient for optimization [30]. There is no clear 
consensus in the literature on which surrogate form is the most appropriate for which 
applications. Despite this, there are several key characteristics of the original model 
that can inform which surrogate form will likely be a good choice. In general, one 
should consider the number of inputs of the original model, whether the output is 
deterministic or stochastic, whether the output is continuous or discrete, and the 
difficulty of sampling the original model [33]. 
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The parameters of a surrogate model are computed based on the data that is 
sampled from the true process model component. The parameters might simply be 
a function of the sampled data, or they might be determined through optimization 
by minimizing some fitness metric over the sampled data. The number of inputs to 
the true process model component is important to consider because it will influence 
the number of parameters that need to be determined, and the number of samples 
needed to sufficiently cover the approximation domain. Both factors will influence 
the computational expense of determining the surrogate parameters. If the number 
of inputs is large enough, then certain surrogate forms that are desirable in low 
dimensions are not practical to apply. For instance, kriging models have been one 
of the most popular surrogate forms in process systems engineering over the past 
several decades. Their popularity is attributed to their ability to represent a large 
range of functions, while requiring relatively few parameters [32]. Additionally, 
as a Gaussian process model, kriging surrogates quantify prediction uncertainty, 
which can be beneficial to the decision making procedure [24, 34]. However, 
kriging models are generally not suitable when the number of inputs is over about 
20 because of the computational expense associated with a matrix inversion step 
needed to calculate the surrogate parameters [32]. The computational expense of this 
step becomes prohibitive as the number of samples grows, which is often required 
when the number of inputs increases. In such cases, high dimensional surrogate 
forms such as ANNs become attractive if not solely for their computational 
advantages. 

The nature of the output of the process model component being approximated 
influences the surrogate model selection. Interpolating surrogate models are often 
desirable for deterministic outputs because the predicted surrogate response is 
guaranteed to be exact at sample points. Kriging is an example of an interpolating 
surrogate model [34]. When the output is stochastic then an interpolating surrogate 
is less suitable, and surrogates that aim to minimize an error metric across all 
samples are preferred. The most common surrogate models are developed for mod- 
eling continuous functions. In process systems applications it can be necessary to 
approximate relationships with discrete outputs [35]. In these cases, surrogates such 
as ensemble tree models, support vector classification machines, or classification 
neural networks might be considered [35-37]. 

When samples from the true process model are very expensive-to-evaluate, then 
a more careful method of interrogating the process model is needed. In the previous 
subsection we discussed methods for adaptive sampling that balance exploitation 
and exploration using an estimate of the prediction variance. To employ these 
methods, it is often desirable to use Gaussian process models that not only provide 
prediction but also quantify the prediction error distribution. The prediction error 
distribution can be used to formulate an acquisition function that can be searched 
for new sample points that balance exploitation and exploration. Surrogate models 
that provide poor predictions when only few samples are provided are not applicable 
in cases where the sampling budget is very low. For instance, polynomial surrogates 
are known to be prone to overfitting when few samples are provided [32]. 
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The challenge of selecting an appropriate surrogate model is well-recognized 
in the literature. Some researchers have attempted to address this challenge by 
developing automatic surrogate model selection algorithms. Concurrent Surrogate 
Model Selection (COSMOS) concurrently selects the surrogate form, a fitness 
metric that is independent and robust to the surrogate form, and searches through 
parameters of the surrogate to minimize the fitness metric selected [38]. Two 
procedures for concurrently selecting the surrogate properties are used, a cascading 
technique that performs selections in a nested iterative approach, and a one step 
technique that formulates all three selections in a single MINLP. The MINLP in the 
one step technique is solved using a genetic algorithm [38]. 

Automatic learning of algebraic models for optimization (ALAMO) is another 
surrogate selection tool that determines the best subset of basis functions to 
construct the surrogate form [30]. It does so by modeling a superstructure surrogate 
form through binary variables, and iteratively solves an optimization problem that 
minimizes a fitness metric over a set of samples that is adaptively expanded until 
certain error tolerances have been met [30]. The approach has been expanded 
to allow the use of first-principles knowledge in the surrogate selection process. 
This is achieved by including continuous constraints on the predicted output 
directly in the regression problem, such as non-negativity. The constraints on 
the surrogate model output are enforced over the entire input domain, not just 
at sample points, thereby making the surrogate model fitting problem a semi- 
infinite programming problem. This can then be solved using classic semi-infinite 
programming techniques. The authors demonstrated the advantages of directly 
incorporating first-principles knowledge in the surrogate selection process. They 
do so by showing that consistently fewer samples are needed to generate surrogate 
models with the same prediction accuracy, when first-principles knowledge is 
incorporated in the surrogate selection process [22]. 


4 Surrogate Modeling Applications in Process Systems 
Optimization 


Henao and Maravelias [9] introduce a systematic approach to applying surrogate 
modeling to optimization-based superstructure methods for process synthesis. First, 
they introduce a systematic method of performing unit model variable analysis so 
that compact and accurate surrogate structures can be identified, where compact 
means considering as few of the original model variables as possible. Once the 
surrogate structures are identified, the selection of surrogate types to model the 
surrogate structures is performed. They demonstrate the use of ANN surrogate 
models to approximate the relationship of a continuous stirred tank reactor. They 
propose a non-iterative SBO approach that uses Latin Hypercube sampling to 
generate the training data for the surrogate model [9]. 
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Caballero and Grossmann [10] also introduce an algorithm for process synthesis, 
or what they refer to as modular flow-sheet optimization. Their algorithm replaces 
the unit operations that have noise, or are very time consuming to evaluate, with 
Kriging surrogate models. Once the Kriging surrogate model is optimized, the 
solution is used to determine a nearby region to generate new samples and discard 
previous samples. The nearby region continuously shifts and contracts over the 
course of the algorithm, thereby continuously improving the accuracy of the Kriging 
surrogate within the region. This approach is known to converge to a local optimum 
to the original process model [10]. 

Fahmi and Cremaschi [39] apply a surrogate-based method to the process 
synthesis of a biodiesel production plant. Similar to Henao and Maravelias [9], 
they also used ANNs to model unit operations that were described through first- 
principles simulators. The design of experiments employed was simply to randomly 
sample each input variable using a uniform random distribution, and there was no 
iterative component to their SBO procedure. Despite this, they demonstrated that 
their SBO procedure is able to determine optimal process flow-sheets that are of 
high quality, and in solution times that were far less than directly considering the 
unit operation simulators. They highlighted the advantage of using ANN surrogate 
models as the flexibility to accurately model any unit operation relationship. They 
identified the selection of the correct architecture as a key difficulty when using 
ANN surrogate models [39]. Selecting the ANN architecture is often done using 
rules of thumb, which do not work well for all applications. 

Quirante et al. [40] use surrogate models to optimize the design of distillation 
columns. Optimizing the design of distillation columns is a challenging problem 
due to the continuous decision variables describing the operating conditions, 
discrete decisions relating to the number of trays in each column section, and the 
complex mechanistic relationships describing distillation behavior. Quirante et al. 
[40] replace the rigorous simulators describing the distillation column behavior with 
Kriging models, and employ these Kriging models in a SBO procedure. The SBO 
procedure is purely exploitation, iteratively solving the MINLP with the Kriging 
surrogate, assessing the accuracy of the Kriging surrogate at the optimal solution, 
and retraining the Kriging model if it is found to be inaccurate at the optimal 
solution. In simple case studies, the Kriging surrogate could be solved to global 
optimality using BARON [41, 42]. For the case studies exploring more difficult 
design problems, the Kriging surrogate could only be solved to local optimality. 
In all cases, the SBO procedure converged within reasonable solution times to 
solutions that were validated to be accurate on the true process model [40]. 

Golzari et al. [43] consider the optimal operation of oil and gas field production. 
They seek to maximize the net present value of oil and gas extracted over the 
lifetime of the oil and gas reservoirs. In this problem, the mechanistic relationships 
governing the oil and gas reservoirs are described using a high-fidelity simulator. 
The simulator can be made up of thousands or millions of process nodes, thereby 
taking several hours for a single evaluation. They propose using an ANN surrogate 
model to approximate the reservoir simulator. They demonstrate the advantage of a 
sequential design of experiments approach that adaptively samples points to explore 
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the search space based on the prediction error of the ANN, versus a simple single 
iteration space filling approach. They employ the resulting ANN surrogate model in 
an iterative SBO procedure with pure exploitation. This means iteratively optimizing 
the surrogate model using a genetic algorithm, and adding the true value of the 
reservoir simulator at the optimal solution to the ANN training data. In a case study 
on a realistic oil and gas field production optimization problem, they compare this 
SBO procedure with a genetic algorithm applied directly on the reservoir simulator. 
They show that the SBO procedure yields higher quality solutions and substantially 
fewer simulator evaluations compared with applying the genetic algorithm on the 
reservoir simulator directly [43]. 

Shi and You [13] consider the optimal integrated scheduling and dynamic 
operation of sequential batch processes [44]. The batch scheduling problem is a 
mixed-integer optimization problem, so the integrated dynamic operation problem 
is a mixed-integer dynamic optimization (MIDO) problem. Using the orthogonal 
collocation method, the MIDO problem can be reformulated into a high dimensional 
MINLP. The integrated optimization problem has a bilevel structure where the 
scheduling and dynamic optimization problems are only connected by processing 
costs and times. As a result, the dynamic optimization problems can be replaced 
with piecewise-linear surrogate models, which only capture the main features of the 
dynamic optimization problems. To ensure the piecewise-linear surrogate models 
accurately describe the dynamic optimization problems, an iterative SBO procedure 
is suggested whereby the surrogate model is adaptively updated when the optimal 
solution is sufficiently far from existing sample points. A case study is provided 
that demonstrates that the adaptive SBO procedure rapidly (<1 min) returns higher 
quality solutions than those found by directly solving the full-space MINLP with 
BARON after 50h [44]. 

Burlacu et al. [29] develop a method of solving MINLPs by adaptively refining 
MILPs, which can be viewed as an iterative SBO procedure [29]. All nonlinearity 
in the MINLP is replaced with integer-linear nonlinearity by using piecewise-linear 
approximations. Their approach focuses on nonlinear functions of two variables. An 
initial sample of the input space of each nonlinear function is taken, and the sample 
points are triangulated to determine vertices of simplicial regions. The piecewise- 
linear functions are then formed by linearly interpolating the output values at 
the vertices of each simplex. The approximation error for each simplex of each 
piecewise-linear approximation can be bounded, which allows for the formulation 
of an MILP that is a relaxation of the original MINLP. The MILP relaxation is 
solved, and if the solution is outside a feasibility tolerance of the original MINLP, 
new triangulations are generated. The new triangulations are generated in a way that 
ensures the approximation error is reduced at the simplex that possesses the optimal 
solution, for each piecewise-linear approximation. By repeating this procedure 
iteratively, the MILP relaxation tightens and convergence to the globally optimal 
solution to the original MINLP is guaranteed. A case study is performed on a large- 
scale natural gas transmission network. It is shown that the SBO procedure is able 
to determine the globally optimal solution within several hours, whereas most state- 
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of-the-art MINLP solvers were unable to determine feasible solutions within 10h 
[29]. 


5 Summary and Future Work 


High dimensional optimization problems arise frequently in process systems engi- 
neering. This is due to the complex mechanistic relationships governing process 
behavior, and/or the large-scale nature of the process system. A proven method 
for solving high dimensional optimization problems is to solve a sequence of 
reduced-dimension subproblems. SBO is one such method for iteratively solving 
reduced-dimension subproblems. The mechanistic relationships governing the pro- 
cess behavior can be described by black-box, gray-box, or white-box models. A 
surrogate model can be used to approximate the key features of the process model, 
thereby reducing the dimension of model. Iterative SBO procedures are popular 
because the surrogate model can be updated with new samples from the true model, 
helping guide the search for high quality solutions. SBO has proven to be effective 
at finding high quality solutions to many process systems optimization problems. 
Specifically, we reviewed the application of SBO to process synthesis, distillation 
column operation, gas and oil field production operation, the integrated scheduling 
and dynamic operation of sequential batch processes, and natural gas transmission 
operation. 

SBO is a necessary tool to solve black and gray-box optimization problems. The 
surrogate model provides analytical expressions with the gradient information that 
is needed for optimization. However, we have shown that SBO can also be useful for 
the optimization of white-box optimization problems. Despite gradient information 
being available for white-box process models, SBO is still beneficial for its ability 
to formulate a sequence of reduced-dimension subproblems that converge to a high 
quality solution to the original problem. In the case of Burlacu et al. [29], their SBO 
procedure is proven to converge to the global optimum of the original white-box 
problem, when certain conditions are satisfied. Their SBO procedure achieves this 
by being able to: (1) bound the approximation error of the surrogate model, and (2) 
iteratively reduce the approximation error around the current solution. With these 
two ingredients, SBO procedures are able to: as a result of (1), formulate reduced- 
dimension subproblems that are relaxations of the original problem, and as a result 
of (2), iteratively tighten the relaxation to ensure convergence to a globally optimal 
solution to the original problem. Future work should explore methods of achieving 
the two ingredients listed above on a wider class of high dimensional optimization 
problems. Doing so would allow SBO procedures with global optimality guarantees 
to be more widely applicable. 
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A Viscosity Iterative Method Mm) 
with Alternated Inertial Terms chm 
for Solving the Split Feasibility Problem 


Lulu Liu, Qiao-Li Dong, Shen Wang, and Michael Th. Rassias 


Abstract In this paper, we propose a viscosity iterative algorithm with alternated 
inertial extrapolation step to solve the split feasibility problem, where the self- 
adaptive stepsize is used. Under appropriate conditions, the proposed algorithm is 
proved to converge to a solution of the split feasibility problem, which is also the 
unique solution of a variational inequality problem. Finally, we demonstrate the 
effectiveness of the algorithm by a numerical example. 


1 Introduction 


This paper studies the split feasibility problem (shortly, SFP), which is described as 
the following form: 


find x*e€C_ suchthat Ax* <Q, (1) 


where C and Q are the nonempty closed convex subsets of R“ and R™, respec- 
tively, and A : RY —> R™ is a bounded and linear operator. The solution set of the 
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SFP (1) is denoted by I’, i.e., 
DT ={x*|x*eC and Ax* € Q}. 


Throughout this paper, we assume that I” is nonempty. 

The SFP was introduced in 1994 by Censor and Elfving [4] for modeling inverse 
problems which arise from phase retrievals and medical image reconstruction [2]. 
Recently, it has been extended to model the systems biology [19] and electricity 
production [22, 23]. 

Many iterative algorithms and results for split feasibility problems have been 
proposed by some authors, see, e.g., [6, 8-10, 16-18, 21]. Among them, CQ 
algorithm proposed by Byrne [2, 3] is the most popular one. 

In [15], Polyak first introduced an inertial extrapolation method as an accelera- 
tion process. Some authors have combined the CQ algorithm with inertial technique 
(see, e.g., [6, 11]). Notice that the inertial term may lose the monotonicity of 
{\|x* — x*|[}cen for the iterative sequence {x*},en and any x* € I, which will 
cause slow convergence of the inertial algorithm. In order to improve this situation, 
Mu and Peng [14] proposed to use alternated inertial technique. 

Very recently, by combining alternated inertial technique and the projection 
method in [9], the authors [7] proposed an alternated inertial projection method 
with self-adaptive stepsize to solve the SFP. Let F(x) = A= Pg)A(x). The 
formula of the method in [7] is given as follows: 

In 2000, Moudafi [13] proposed a viscosity algorithm and proved that the 
algorithm converged to the fixed point of the nonexpansive mapping. In 2004, 
Xu [20] further proved that this limit point is the unique solution of a variational 
inequality. By combining Algorithm 1 and the viscosity method, we propose a 
new viscosity type algorithm to solve the SFP. We establish the convergence of 
the iterative algorithm under some simple assumptions. 

The paper is organized as follows: In Sect. 2, we will give some definitions and 
lemmas needed in the next analysis. In Sect. 3, we propose a viscosity type algorithm 
and prove its convergence under mild conditions. In Sect.4, we give a numerical 
example and show the advantages of the proposed algorithm. 


2 Preliminaries 


In this section, we review some definitions and lemmas which are used in the main 
results. 
The following identity will be used for the main results (see [1, Corollary 2.15]): 


llax + Byll? = a(@ + B)|lx|I? + B(@ + B)Ilyll? — «Bllx — yl’, (2) 


for alla, 8 € Randx, ye RX. 
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Algorithm 1 Alternated inertial projection method with adaptive stepsize 


For any uw € (0, 1),A,; > Oand—-1 < a, <0. 
Let x°, x! € R be given starting points. Set k := 1 


Compute 
k xk k = even, 
w= 
xk +eag(xk — xk-!), kk = odd. 
and 
y* = Pc(w* — Ay F(w*)). 
Calculate 
k k 
jellw* — y*I k k 
in y — ——— Ap ~—Oif F(w*) — FQ") £0, 
AR = | Fw") — FO* II 
Ak otherwise. 
Compute 


xt! = Po(w* — yor Fy"), 


where y € (0, 2), 


(wk — yk, d(wk, y*)) + AKI — Pe) AO*)II? 


Ok = 
|d(w*, y*)|/? 


and 


d(w*, y*) := (wk — y*) — a, (F(w*) — F(y*)). 


The projection is an important tool for our work in this paper. Let K be a closed 
convex subset of R™. Recall that the nearest point or metric projection from R% 
onto K, which is denoted Px, is defined as follows: for each x € R", Px (x) is the 
unique point in K such that 


l|x — Px (x)|| = min{||x — zl]: z € K}. 


The following two lemmas are useful characterizations and properties of projec- 
tions: 


Lemma 1 For any x € R® andz € K, then z = Px (x) if and only if 


(x-z,y-—z) <0, Vyek. 
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Lemma 2 For any x, y € RN andz € K, the following hold: 


(i) || Px (x) — Px (y)|I? < (Px (x) — Px(y),x — y); 
(ii) || Px (x) — z\|? < llx — zi]? — ||Px@) — x1/?; 
(iii) (UI — Px)x — I — Px)y,x — y) = || — Px)x — UI = Px)yll?. 


Definition 1 Let F be a mapping from a set C C R% into R%. Then we have the 
following: 


(i) F is said to be monotone on C if 
(F(x) — F(y),x-—y) 20, Wx, ye; 
(ii) F is said to be a-inverse strongly monotone (shortly, a-ism) with a > 0 if 
FQ)=FOia=y)2elFG=FOl.. Yarec 
(iii) F is said to be Lipschitz continuous on C with constant L > 0 if 
IF @)-— FO) s Lilx—yll, Vx, yee. 


Definition 2 Let f be a mapping from a set C C RY into RN. If there exists a 
constant x € (0, 1) such that 


IFC) — fOls«lla—yl, xy EC, 


then f is said to be a contraction. 
Lemma 3 ([5]) Let F = A’ (I — Pg)A. Then 


(i) F is Lipschitz continuous with L = || Al’; 


(ii) F is t-inverse strongly monotone. 


3 Main Results 


In this section, we propose a viscosity iterative method with alternated inertial term 
and self-adaptive stepsize and prove its convergence under appropriate conditions. 


Let f : C — C be acontraction with constant k € (0, 2). Now we give the 
iterative method: 


Assumption 8 We assume that the inertial parameter a, and the parameter Bx; 
satisfy the following conditions: 


Gi) -l+e<a,z < —-e,e > 0; 


(ii) limg—oo Be = O and 7F" 5 Be = 00. 
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Algorithm 2 A viscosity iterative algorithm with alternated inertial terms 


Step 0 Choose the parameters a; satisfying some assumption, jz € (0, 1) and 
dy > 0. Let x°, x! € R% be given starting points. Set k := 1 


Step 1 Compute 


Me = es k = even, 
xk + ag (xk — xk-!), kk = odd. 
and 
yk = Po(wk — x F(w*)). 
Calculate 


k k 
min | tl aa if F(w*) — F(y*) £0, 
Akh = 
Ak 


Fw") — FO*)II (3) 
otherwise. 
If y* = w*, STOP. Otherwise go to Step 2. 
Step 2 Compute 
xl = B f(w*) + (1 — Be) Po(w* — yore FO"), (4) 


where 6; € [0, 1], y € (0, 2), 


(wk — yk, d(wk, y*)) + Aglld — Po) AG*) II? 
IId(wk, y*)||2 ; 


Ok i= 
and 
d(w*, y*) := (w* — y*®) —Ag(F(w*) — F(y*)). 


Set k := k + 1 and return to Step 1. 


Remark I ([7, Remark 3.3]) Note that by (3), Ax41 < Ax, Wk => 1. Also, observe 
in Algorithm 2 that if F(w*) H F(y*), then 


plws— yA pllw yA 
|F(w*) — F(y|| ~ WAI Ilwk — yk] Al? 


which implies that 0 < min{A, yar! <A) x, Wk > 1. This means that limy_, oo Ax 
exists. Thus, there exists X > O such that limg_..5 Ax = 2d. 


Lemma 4 ([{7, Lemma 3.4]) Let {x" ren be generated by Algorithm 2. Then there 
exist a constant 6 € (qu, 1) and a positive integer Ko such that we have, for any 
k > Ko, 


(wk — y d(w*, y) = (1 — 8) I] — y* 7, 
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and 


_ 1n8 
sees eee 


Lemma 5 ([7, Lemma 3.5]) Let {x heen be generated by Algorithm 2. Then we 
have 


(wk — x*, d(w*, y*)) > oglld(w*, yII?, Wx* er. 


Lemma 6 Suppose that Assumption & (i) hold. Then the sequence {x2 heen gener- 
ated by Algorithm 2 is bounded. 


Proof For convenience, set 
uk = Po(w* — yond Fy"), 
then iterative scheme (4) can be rewritten as follows 
TT = By f(w') + (1 — Buu. (5) 


Pick a point x* in I”. By Theorem 3.9 in [7], we have 


2k+1 2 2k d 5)4 2k 2 
Jkt? — x? <|x%* — x* |? — +ornr v2 Yq gee y*| 
(1 — 6)4 2k+1 2k+1 2 
2 — 
y( Wop gle | 
+ resi (L + are 41)u* — x7, 
(6) 
and 
uk — x? < [lx — x* I). (7) 


By Assumption 8 (i) and (6), we get 


2k+1 ojo 2 
jer =a Sle Sa (8) 
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From (5), (8) and the contraction of f, it follows 
[1x22 — * 1]? =|] Boner Cf wt!) — x*) + (1 — Borg) e741 — x*)|? 
<Poe sill f (wt) — x* |? +L — Boga) eet — x* |? 
<Poe+i2I f (w*!) — Fe* DI? + 2 F@*) — x* 117] 
2k * 12 (9) 
+ (1 = Borg) ||x* — x*| 


2 2k+1 2 2 
<2Box41K 7 wet! — x* ||? + 2Boesill f x*) — x*|| 


+ (1 = Boegi) ix — x* 7, 


where the first inequality comes from (2). In addition, by the definition of w* and 


(2), 


\| weet — ei =|[x2441 2k+1 _ 


+ Ong (% =) 


=x"? 


= [1 + ang)! — 2*) — wanes (x — x*) |? (10) 


2k+1 2 2k 2 
<(1 + cng 1) eT! = x" |]? = cre gala — x* |]. 


Similarly, by (7), (9), w2* = x?* and x € (0, 2/2), we get 


Fl — x¥ I)? <2BoeK x — x"? + 2Boell f*) — x71? + = Bowl — 2°? 


=[1 — Box (1 — 2k?)] ]x2* — x* |? + 2Boxll F*) — x* |? 
<lx7* — x* |? + 2Boell f*) — x" 17. 


(11) 
Putting (11) into (10), we get 


[wth — 1]? <2 — x* |? + 2 + creer) Boll f(@*) — x* 7. (12) 


Combining (9) and (12), we have 


x72 11? <2Boeg |x — x* |)? + 4 Boe Bong ik- (1 + one dILfO*) — x* I 
+ 2Bonsill f(x*) — x* 1]? + (1 — Bonga dllx* — x* |)? 
=[1 — Boxp1(1 — 2«?)]l)a* — x* |]? 


+ 2Box+i[1 + 2Boxk?( + ress) ] IF O*) — x7 I? 


<[1 = Borg 1(l — 27) ] lf — x" |]? + 2Boe yr (L + 2u7)II fe") — x* |]? 


2(1 + 2x? 
: ‘ ee fo") = x*1P). 


< max{||x°* — x*|| 


’ 
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Therefore, 


‘ a 


2k+2 2 0 Ze 
[|x2#*2 — x* ||? < max{||x° — x*|/*, = I f(x*) — x*||7}. 


—xX 


Then, {x7}, en is bounded and hence { f (x**)},en is bounded. 


Lemma 7 Suppose that Assumption 8 hold. Let the sequence {x*},en generated by 
Algorithm 2. If 


ce! Ky 2 (1—6)* ong oepay2 
Jim [1 +ere07@ yaya PHP +O yg bw — yey 


= 041 (1 + r41)||u* — x? ||7] = 0, 


(13) 
then 
lim || F(y*)|| = 0, 
ae IFO Il (14) 
lim |[x* — y*|| =0, 
a. |x" — y" | (15) 
Pil lly c(y")|| = 9, (16) 
and 
lim ||Ay* — Po9(Ay*)|| = 0. (17) 
k->0o 
Moreover, w(x?) cr, 
Proof By Assumption 8 (i) and (13), we get 
lim |]x** — y**|] = 0, (18) 
k>oo 
k->0o 


and 


lim |Ju7* — x?* || = 0. (20) 
k> oo 
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From (5), it follows 


[[x2A+? — 71) I] Bon f (x2*) — x7) + 1 — Box) (u* — x7] 


<Boxll f (27) — x7*l] + 1 — Bog)lu* — x I, 
which with (20) and the assumption on 6; implies 


: 2k-+1 Qky 
a Ilx —x™|| =0. (21) 
Combining (18), (19), and (21), and following the proof of Lemma 3.7 and Theorem 
3.8 in [7], we can get (14), (15), (16), and (17). 

Next, we show w(x") C I. Since {x**},<n is bounded, we can arbitrarily take 
xe w(x?) and let {x7} ony be a subsequence of (x7) en converging to x. From 
(15), it follows that {yen also converges to x. From (16) and (17), we have 
x €C and Ax € Q. Therefore, x € I. 


In the proof of the main results, we also need the following technical lemma from 
[12]. 


Lemma 8 Assume {seen is a sequence of nonnegative real numbers such that 


stl <(1 — O)s* +5", k>0, 


sk+1 gh _ nk + a”. k>0, 


where {0x}xen is a sequence in (0, 1), {n' een is a sequence of nonnegative real 
numbers, and {ren and {y een are two sequences in R such that 


@) Vo % = 00, 
(ii) limps oo 0* = 0, 
(iii) limy—so9 nh = 0 implies limsup)_, ok < 0 for any subsequence {ki}icen 


{k}cen. Then limy_so0 s* = 0. 


Theorem 1 Suppose that Assumption 8 hold and x € (0, /2/2). Then the sequence 
{x ren generated by Algorithm 2 converges to the solution x* of the SFP (1) which 
solves the variational inequality problem: 


(I — f)x*,x—x*)>0, Were. (22) 


Proof Let x* € I satisfying the variational inequality problem (22). On the one 
hand, from (5), (8) and (12), we have 


x22 — 4)? 


= | Boeri Cf (wt!) — x*) + = Borg) (w*tt — x*) 1? 
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= BaF (wt) — x* I? + 1 — Bogs) e284! = x"? 
+2Bor41(1 — Borsa) f (wt!) = x*, wt! — *) 

= B34 I fw) — 82 + = Boe)! = x 2 
+2frx+10 — Boeri) f wt!) — fOr), wT — x*) 


+2Bor+1(1 — Bor+i)(f(x*) — x*, wht! — x*) 


< (1 — Borg) ue! — x* I? + Bonga — Borgi)(e? lw! — xy? + att — x*|)?) 


+ Bopp ill f (wet!) — x* |? + 2Boeri Cl — Boeri fe*) — 2%, wht! — x*) 
= (1 — Bogs s)leht! = x* ||? + Bor si — Borer? | wt! — x* |? 

+ Bop rll f (wt?) — x* |? + 2Boeri CL — Borer (fOe*) — x, wt! — x*) 
< [1 — Bors (l — «71 = Boxsi))IIx* — x* |? 

+ Box41[2 = Boz idK7(L + ora 1) Box ll fx) — x* 1? 

+Boesill f(w%t") — x* |? + 21 — Boras) f Oe") — x*, WATT — x*)] 


= (1 — O%)||x* — x* |? + Ode, 


where 


O, =Bor+i Cl — K7(1 — Bor+1)), 


_20 — Borge? (1 + 2K-41) Box 


5k 
1 — «?(1 — Boe+1) 


Lf (*) — x* |? 


Bok-+1 2k+1ly _ #2 
(220i) | f(wrr") = x" II 
2(1 — Box+1) * *  2k+1 x 
(ipa oy 


On the other hand, we also have, from (5) and (6), 


2k4+2 2 2k+1 2 2k+1 2 
Ix? — x* PP <1 = Borgia tt — x* |? + Bor sill f(w**) = x" lI 


(1— 8)" | ox 


2k * 2 
<a — 2°17 — 1+ ody 2 — Yael — y 


(1 — 6)* 
yQ2 y) nl rs Be I] wet _ yore \| 


(23) 


cad (fg 


t+ oreg 1 + ore gs) lurk — x7)? + Bors rll f (wet!) — x* 7. 


(24) 
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Set 
sk =||x>* ca #7, 


(1-64 
Geel — AIP 


nk =(1 + 0% 41)y (2 — vy) 


(io 
+y(2- Yap pple a PN? opie (1 bag) lu — (7, 
o* =Borsil| fw") — x*|/?. 


Then, from (23) and (24), we derive the inequalities as follows 


sk! <(1 — Oy)s* + 05%, 


ght <s* _ nk + a’, k>0. 


Since x satisfies the Assumption 8 (ii), we obtain ar Oy = oo and limg_.o. 0* = 
0. To use Lemma 8, it suffices to verify that, for any subsequence {kj}ien C {k}xen, 


lim n* =0 
I>0o 
implies 
lim sup 5h <0, (25) 
l>00 


Since limg-s oo Bx = 0, and limg_so9(1 — K7(1 — Box41)) = 1 — k?, to get (25), we 
only need to verify 


lim sup( f (x*) — x*, ut! — x*) <0. 
l+oo 


Observe that 


2ki+1 _ 2k 2k +1 2k +1 2k, 
let ee a ae 


\|u = VO2k +1A2k 41 F (y 


kyl 


x71 4 yoo tidal Ot) 


2k +1 _ 


S|lw 


xb yore t1doy til OXI, 


=(1 + a4) +1)||x 
which with (21) and (14) yields 


lim |Ju241 — 32k) = 0. 
l>0oo 
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Since x* is the solution of the variational inequality problem (22), we obtain 


lim sup( f (x*) — x*, ut! — x*) 
loo 


= —liminf((I — f)x*, ut! — x*) 
l>oco 
= — liminf((I — fy a" a=) 


= —inf((I — f)x*,% — x*) <0, 


where * € w(x!) C I. From Lemma 8, it follows 
lim |x? — x*|] =0. 
k->oo 


That is, x7* — x*. By (21), we get x7*+! — x*. So, the sequence {x*}pen 


converges in norm to the solution of the SFP (1) which solves the variational 
inequality problem (22). 


4 Numerical Results 


In this section, we provide a numerical example to compare Algorithm 2 with the 
algorithm (2.1) in [16]. All codes were written in MATLAB R2016a and performed 
on a PC Desktop Intel(R) Pentium(R) CPU N3540 @ 2.16 GHz 2.16 GHz, RAM 
4.00 GB. 

For convenience, we denote the vector with all elements 0 by eo and the vector 
with all elements 1 by e; in what follows. In the numerical results listed in the 
following tables, “Iter." and “CPU time" denote the number of iterations and CPU 
times in seconds, respectively. 


Example I Consider the SFP, where A = (dij)mxn € R”™*” and aj; € (0, 1) 
generated randomly and 


C={x eR": |x —dll <r}, 
Q={yeER”": L<y<JU}, 


where d is the center of the ball C, e9 < d < e;, r € (10, 20) is the radius, d 
and r are both generated randomly. L and U are the boundary of the box Q and 
are also generated randomly satisfying 10e; < L < 20e; and 20e; < U < 30e}, 
respectively. The initial point x° € (0, 100e1) is randomly chosen. In the numerical 
experiment, we took the objective function value p(x*) — lx — Pc (x*)|I? + 


5 |Ax* —Po (Ax*)||? < ¢ as the stopping criterion and e = 0.05. 
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10'° T T = 
Perey Alg 2 
= = =Alg (2.1) in [16] 


10° [ L L L 
0 500 1000 1500 2000 2500 3000 


k 


Fig. 1 Comparison of Algorithm 2 with the algorithm (2.1) in [16] for Example | when m = 
500, n = 500 


Table 1 Comparison of Algorithm 2 with the algorithm (2.1) in [16] for Example | 


Problem size | Iter CPU time 

m Algorithm 2 | Algorithm (2.1) in [16] | Algorithm 2 | Algorithm (2.1) in [16] 
500 500 | 43 2865 0.0078 0.3023 
500 | 1000 | 28 2348 0.0090 0.4587 

1000 500 | 72 8314 0.0205 1.3519 

1000 | 1000 | 89 6179 0.0571 1.9624 


We take = 0.45, y = 1.85, a, = —0.55, A, = 0.1 and 6, = i in Algorithm 2 
and a, = i Bn = 0.1, Yn = 1 — Bn — On and Ly = ae in the algorithm (2.1) of 
[16]. Note that we take f(x) = < The corresponding results reported in Fig. | and 
Table | illustrate that Algorithm 2 performs better than the algorithm (2.1) in [16] 
from the iteration numbers and CPU time. 


Acknowledgments The first and second authors were Scientific Research Project of Tianjin 
Municipal Education Commission (No. 2020ZD02). 
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Efficient Location-Based Tracking for (®) 
IoT Devices Using Compressive Sensing com 
and Machine Learning Techniques 


Ramy Aboushelbaya, Taimir Aguacil, Qiuting Huang, and Peter A. Norreys 


Abstract In this chapter, a scheme based on compressive sensing (CS) for the 
sparse reconstruction of down-sampled location data is presented for the first 
time. The underlying sparsity properties of the location data are explored and two 
algorithms based on LASSO regression and neural networks are shown to be able 
to efficiently reconstruct paths with only ~20% sampling of the GPS receiver. 
An implementation for iOS devices is discussed and results from it are shown as 
proof of concept of the applicability of CS in location-based tracking for Internet of 
Things (IoT) devices. 


1 Introduction 


Compressive sensing is a scheme that was developed to reconstruct data that was 
down-sampled below the Shannon frequency. It accomplishes this by exploiting 
certain statistical properties assumed to exist in the true data. It has been used, 
to great success, in many different applications ranging from signal processing, 
medical imaging to scientific diagnostics [1-3]. The main advantage to it lies in the 
fact that it allows one to reduce the amount of times a sampling device needs to be 
used while still being able to reconstruct the needed data. This makes it an attractive 
prospect for use in applications where the complete sampling of the data is either 
impossible or impractical, such as the problem of Global Positioning System (GPS) 
location and trajectory tracking. 

GPS tracking has a wide variety of applications across many different fields 
ranging from traffic management [4], to big data mobility analytics (e.g., origin- 
destination mapping) [5], and even to studying wild life [6]. Additionally, many 
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Internet of Things (IoT) devices have functionalities that depend on location-based 
tracking, such as health and fitness devices and trail tracking applications [7, 8]. 
In the vast majority of these cases, the data is collected on an embedded device 
using a GPS receiver and stored for later processing. However, constant sampling 
of the receiver can be very detrimental for the battery-life of the device which 
poses a problem for portable devices as well as produce an immense amount 
of data requiring large storage capacities. Additionally, varying atmospheric and 
environmental conditions can mean that some data points will not be sampled due 
to the high noise level seen by the receiver. 

To that end, there has been a considerable amount of research interest aimed 
at reconstructing trajectory data. There have been different approaches to this 
problem each with its own strength and limitations. One of them involved using 
map matching and has shown high accuracy in reconstruction [9]. However, by 
definition, map matching requires access to road maps and as such would only 
work in locations where movement can only occur on well-defined and mapped 
paths which limits its applicability to vehicle mobility. Another approach used dead- 
reckoning to correct sparse GPS signals [10] in the context of wildlife tracking, 
i.e., with no predefined paths. However, dead-reckoning requires accelerometers 
or similar devices to gauge the movement being tracked. Again, this limits the 
applicability of these types of algorithms to devices which contain accurate mobility 
sensors. In order to avoid path constraints and sensor requirements, it is possible to 
use signal processing techniques to compress GPS trajectories in a way to retain 
the most relevant information allowing for efficient reconstruction when the data 
is needed while reducing the required space for data storage [11]. Although this 
technique has its advantages compared to the others, it does not solve the issue 
of power efficiency as the GPS receiver is still being constantly used. While this 
may not be an issue for applications such as vehicle tracking which are not power- 
sensitive, it poses a problem for portable and embedded applications. It also does 
not solve the issue of missing points due to environmental or weather issues. What 
would be optimal is a way to reduce the sampling of the GPS receiver and allow for 
random missing points. As such, this makes location tracking an ideal candidate for 
compressive sensing (CS) reconstruction. 

In this chapter, we show, for the first time, how the intrinsic sparsity properties 
of trajectory signals can be exploited to reduce the sampling acquisition, hence 
the GPS receiver usage, as much as possible while still allowing for accurate 
path reconstruction. Our compressive sensing-based approach allows us to treat the 
problem of path reconstruction as a constrained optimization problem that can be 
solved even on devices with low processing power. Furthermore, less data means 
less storage is required on the embedded device, or on a remote server, which could 
prove beneficial if the tracking is required over a long period of time or if a large 
amount of paths need to be saved in the context of big data mobility analytics. 
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The chapter is organized as follows: first, the signal model is discussed along with 
the underlying properties and how they can be exploited. Then, the general operating 
principles of the novel compressive sensing approach to trajectory reconstruction 
are presented. Following that, two different approaches to CS are explored and their 
relative strengths and weaknesses are elaborated. Finally a discussion of future work 
concludes the chapter. 


2 Signal Representation 


2.1 Principles 


In most embedded devices, location data is output at a given time ¢ as a pair of 
longitude and latitude coordinates which shall be denoted as (A(t), (¢)). They also 
provide data about the accuracy of a given coordinate pair in the form of a numerical 
estimate for the probable error evaluated in meters [12]. In other words, if one 
could imagine a circle centered at this coordinate pair with a radius equal to the 
estimated error, then the true location of the device will lie within this circle with a 
certain probability whose value depends on the manufacturer. Since GPS receivers 
rely on satellite communications, these errors can have multiple different sources. 
The physical environment in which the receiver operates, such as open-sky or urban 
conditions, significantly affects the quality of the received signal. A classic example 
would be the user going through a tunnel where there is little or no satellite signal. 
In addition to that, devices vary greatly in their processing techniques, complexity, 
and sophistication, which in turn vary the quality of the signals they produce. For 
these reasons, this chapter presents algorithms that have been designed and tested to 
be both efficient and robust against a wide range of error values. 

Due to the above-mentioned complex nature of their sources, real location 
errors do not necessarily follow a simple statistical distribution. However, for 
simplification, we assume that they follow a Gaussian distribution. In that case, the 
previously defined circle would generally represent an integer multiple of standard 
deviations. 

As with most numerical schemes, the user’s movement is sampled at discrete 
time intervals ¢,,. Their path can be thus represented as a2 x N vector of (A[n], é[7]). 
With each pair representing their location at a specific sampling point. This vector 
shall be then referred to as the fully sampled path. 

The goal of the protocol outlined in this chapter is to reduce the number of 
samples the device has to actually gather while still being able to reconstruct the 
full path to a given accuracy. The degree of downsampling is quantified by the 
sampling ratio (v) which is just the ratio of the number of samples collected to 
the total number of samples in the full path, referred to hereafter as the path length 
(N). 
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2.2 Model 


In order to test the different CS algorithms that were designed and optimize them, 
we needed a significant amount of fully sampled GPS trajectory data to serve as the 
ground truth to which the results of the reconstruction can be compared. To that end, 
we opted to simulate the trajectory data that was used to gauge the performance of 
algorithms in the design stage. The advantage of using simulated data in this step 
lies in the flexibility offered by the ability to easily control the path lengths as well 
as the amount of noise added to the data points. After the design stage, the optimal 
algorithms were implemented on a portable device and made to run on real-life data 
to evaluate its performance in real time. This section aims to elaborate on the model 
that was used to generate the simulated data. 

A lot of research has been conducted around modelling human trajectories 
[13, 14]. Many different models have been proposed taking into account different 
parameters. Even though no conclusive results have been reached about the statisti- 
cal nature of human movements [14, 15], we are not interested in their distribution 
but in the individual structure of the trajectory itself. For this reason, we chose the 
random walk model (RWM) for our simulations. 

The random walk has played an integral role in the modelling and analysis of 
stochastic processes ever since Karl Pearson introduced the original random walk 
problem [16]. The random walk which we used to generate the trajectory data is 
the two-dimensional walk on the real plane R2 in discrete time, also known as the 
Pearson random walk [17]. There, at each time step, the walker takes a step of a 
fixed size in a completely random and isotropic direction. As such the walker’s 
(x, y) coordinates in the plane at time n + | can be written in terms of their position 
at time n as: 


x[n + 1] = x[n] + cos(6[n + 1]) 
yln + 1] = y[n] + sin(@[n + 1) 


(1) 


where the steps are taken to be of unit size and 0[n] ~ WY (0, 277) is the heading 
which is drawn from a uniform distribution. The statistical properties of the Pearson 
random walk have been studied extensively ever since its introduction and it was 
shown that at long times, the distance of the walker to the origin (0 = \/x?2 + y?) 
obeys the well-known Rayleigh distribution: 


2 
P(p) = ee (2) 


After the trajectory path is generated, noise is overlaid on it by sampling from 
a normal distribution ~ -V (0, o2) with a standard deviation that defines the noise 
level (o,). This noise level may be constant for each data point on the path or it can 
vary between points in order to simulate the effect of transient noise sources. 
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Of course, due to its ellipsoidal shape, the surface of the earth is not planar. 
In fact, as mentioned previously, GPS data comes in the form of ellipsoidal 
latitude/longitude coordinates. For our algorithm to be useful in real conditions, it 
needs to be able to perform the reconstruction on trajectory data in these coordinates. 
To that end, the path generated using the RWM on a 2D plane is mapped to the 
Lat/Lon coordinate system using the Pseudo-Mercator projection [18], which is 
based on the WGS84 reference ellipsoid, before being analyzed and fed into the 
algorithm. This projection was chosen because of its widespread use in mobile 
mapping applications. 


3 Compressive Sensing 


3.1 Basics 


Compressive sensing is the general name for a class of algorithms that allow for the 
subsampling of signals to levels below the Shannon frequency, while still being able 
to reconstruct the signal ex post facto with a high level of accuracy [19]. Its basic, 
most generic, structure can be seen in Fig. 1. To understand the principle behind 
the scheme, consider a one-dimensional (1D) vector x € R” of real numbers, and 
let this be the signal that one wants to measure. If one considers a subsampling 
measurement matrix A € R”*” which maps the true signal to the measured 
signal y = Ax, then the goal of compressive sensing is to recover x from y with 
m << _n. Now, of course, this problem is not generally solvable, as the mapping 
between x and y is underdetermined and this leads to an infinite number of exact 
solutions. However, the scheme gets around this by imposing certain properties on 
the underlying true signal. Specifically, compressive sensing assumes that x is sparse 
in some basis { f;}je[1,]. This means that our signal can be written as 


ne wf = Fo (3) 


; \ 
\ Signal Sampling Generic recovery algorithm | 
' 
| ! : 
1 Input Signal ; q } 
i 
{ ° Compressive } | Sparse { Output 
! 4 measurements approximation t Signal 
1 ee rs 
{ 1 y | a d x 
1 Sub-sampling ; f 4 
1 ea iJ 
t matrix ! t Reconstruction matrix Sparsifying basis | 
A >| 1 
{ ! \ M F i 
q } 


Fig. 1 Diagram showing the general structure of the CS scheme to reconstruct sub-sampled paths 
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Where F € R”*” is called the sparsifying matrix and a; are the sparse components 
of the vector in the new basis. The problem then can be modeled as an o- 
minimization subject to a constraint. 


& = argmin||e||9 with Ma =y (4) 
a 


Where x = Fé@ is the reconstructed signal and M = AF is conventionally called 
the reconstruction matrix. The £9 norm is commonly defined as the number of non- 
zero components of a vector which means that it quantifies the true sparsity of the 
vector. Using this paradigm, this means that we can treat path reconstruction as 
multidimensional constrained optimization problem. 


3.2 Sparsity 


Sparsity of « means that most of its components have a value of zero. For example, 
the Fourier domain representation of a perfect sine wave oscillating at a given 
frequency is a delta function at that frequency and zero elsewhere. Another more 
relevant example would be the discrete wavelet transform used in JPEG 2000 
compression. The assumption behind this compression scheme is that most images 
are sparse in the wavelet domain [20]. This means that, in this domain, one needs 
to only store the non-trivial components of the image and can discard most of the 
others thus greatly reducing its size. This argument is one of the motivations for 
using CS in imaging techniques [21]. 

The first obstacle in implementing CS, as it was described, is that Eq. (4) is a NP- 
hard problem [22]. The £9, although it models perfectly the concept of sparsity, is 
actually poorly adapted for real-world use for many reasons. For example, realistic 
signals are rarely exactly sparse, in the sense that they rarely have mostly zero- 
valued components in some basis. In most applications of CS, one tends to consider 
pseudo-sparsity, where in some basis most of the components are much smaller than 
the few major components of the signal. This way, the former can be safely ignored 
while insuring that most of the important information of the signal is captured and 
thus it can be accurately reconstructed. Furthermore, the £o-norm is a pseudo-norm 
that is non-differentiable. Last but not least, such solutions are very sensitive to noise 
and, in real-world scenarios, signals are always noisy. 

In order to circumvent the previously mentioned obstacles, one of the options is 
to use a convex relaxed optimization. This can be achieved by using the €;-norm 
instead. Hence, Eq. (4) can be rewritten as: 


& = argmin||o||; with Ma =y (5) 
a 
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where 


[leis = Slow (6) 


i=l 


In order for this relaxation to be valid for a perfect recovery, A and F must 
be mutually incoherent as much as possible [19]. In other words, the lower the 
maximum correlation between any two elements of the matrices, the fewer the 
number of measurement required for a perfect reconstruction. 


3.3 Discrete Cosine Transform 


Considering location data, even a randomly generated one, it can be safely assumed 
that they contain fairly low frequency oscillations. This intuition was behind the 
choice of the discrete cosine transform (DCT) as the sparsifying basis for our signal. 
To recall, the discrete cosine transform (DCT) of a 1D vector x € RY is defined as: 


N-1 
on = 2 Y~ xg cos(sen(2k + 1)/2N) (7) 
k=0 


This equation can be conveniently rewritten as a = Fo with Fpcr being the 
matrix representation of the discrete cosine transform. If the location data is treated 
as two separate 1D vectors representing latitudes and longitudes, the DCT of these 
vectors can be shown to be pseudo-sparse. To prove this, we have run simulations 
generating a large amount of random walks of different sizes. Figure 2 shows that 
by comparing the largest components of the transformed path to the rest, it can be 
clearly seen that (A[n], é[n]) data is indeed pseudo-sparse in this basis. Another 
interesting thing to note about the results in this figure is that as the added noise to 
the path increases, the apparent sparsity of the signal decreases. This means that by 
biasing the reconstruction toward sparser paths, it is possible to denoise the acquired 
signal and even, in some cases, recover a path that is more accurate than what could 
have been obtained even by sampling it completely. This shall be elaborated further 
in the following sections below. 


3.4 Measurement Matrix 


The final step before solving the CS problem is determining the appropriate 
measurement, or subsampling matrix. Random matrices drawn from 1.i.d. Bernoulli 
or Gaussian distributions have been shown to be incoherent with any other basis 
[23]. As a starting point, let us consider a sampling scheme based on the Bernoulli 
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Fig. 2. Plot showing the sparsity analysis of the DCT of a random walk. Each curve plots the 
percentage of components whose absolute value lays below a given threshold relative to the 
dominant components in the signal. For example, the blue curve shows that, with no noise, almost 
90% of the cosine components of the path are less than 5% of the largest component 


distribution where at each time step we generate a pseudo-random number that is 
used to determine whether or not a data point is sampled based on the probability 
defined as: 


pinj=v (8) 


where, again, v is a predetermined sampling ratio. This gives the draw outcome, 
b[n] € {0, 1}, which decides whether or not to request a new sample. Furthermore, 
the draw outcomes can then be used to generate the subsampling matrix A, which 
together with the DCT sparsifying matrix form the reconstruction matrix M. It 
should be noted that most CS applications have problems implementing random 
matrices due to the difficulty of storing and reproducing them at the receiver. 
Luckily, location data signals are time based, hence the measurement matrix is 
directly mapped to the request from the GPS receiver for a new sample acquisition. 
Unfortunately, in its current form, the subsampling matrix is completely agnostic 
to de, i.e., it treats all data points as being equal. This is, of course, not the case 
in a real-world application, where o, varies along the path. Also, the fact that the 
sampling is based on random number generation means that the effective sampling 
ratio is almost never the same as the predetermined v. This is detrimental to some 
applications that require a fixed known sampling ratio. These shortcomings will be 
addressed in the following sections using the so-called adaptive sampling schemes. 
The complete process for generating the subsampling matrix is illustrated in Fig. 3. 
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Fig. 3. Diagram showing the basic process of generating the subsampling matrix and using to 
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Fig. 4 Diagram showing the block segmentation and overlap scheme 


3.5 Block-Wise Reconstruction 


Subsampling L out of the entire path length N and then applying any reconstruction 
algorithm is practically unfeasible. This is due to the fact that N can become very 
large and therefore L as well, thereby increasing the complexity of the algorithm 
beyond the capabilities of any embedded device. In order to circumvent this 
problem, we adopt a block-wise reconstruction method as shown in Fig. 4. A time- 


382 R. Aboushelbaya et al. 


wise sliding window is utilized to form each block and apply the CS scheme to it. 
The figure also demonstrates a sophisticated overlap-cut-save (OCS) method that 
efficiently manages the block boundaries by splitting it into a buffer zone and a 
sub-block in order to mitigate any discontinuities and interference effects caused by 
the basis transform. In other words, reconstructed samples in each sub-block do not 
suffer due to missing information which is obtained from past and future samples 
outside the sub-block. This method effectively means that the CS protocol will be 
only applied to the block length B. This parameter can be optimized independently 
of the path length WN to have the best performance possible with the least complexity. 


4 Lasso 


4.1 Algorithm 


Equation (5) is closely related to the least absolute shrinkage and selection operator 
(LASSO) regression problem in statistics [24], as both of their Lagrangian forms 
can be written as: 


, il 
& = argmin { lly — Mee||3 + lee} (9) 
a 


Where y is the measured down-sampled block, M is the reconstruction matrix, B 
is the total number of samples in the full block, & is the reconstructed block in the 
DCT domain, and jz is conventionally called the learning rate or the regularization 
parameter. This formulation can be understood as trying to find the “closest” vector 
to y, in the sense of the £-norm, which has the smallest €;-norm. As previously 
discussed, the €)-norm here ensures that we are biasing the regression toward 
more sparse vectors, with the degree of bias being controlled by jz. Expressing the 
problem as Eq. (9) turns it into a linear regression problem and allows the use of 
traditional LASSO regression solvers. For this case, we have used the conventional 
Coordinate Descent algorithm which is known to be quite effective for the LASSO 
problem and easily implementable across various platforms [25]. The mathematical 
details of this algorithm are beyond the scope of this chapter but can be found in the 
literature [24]. 

Reconstructing location data using this scheme has many advantages. The 
algorithm is implemented in a way that is completely agnostic to the size of the 
input which means that the block length (B) and the sampling ratio (v) can be 
changed at whim. It also means that the variance of the length of the down-sampled 
path, inherent to the Bernoulli-based measurement matrix mentioned above, is not 
detrimental to its operation. Furthermore, the hyper-parameters such as the learning 
rate can be readjusted if necessary after each block. On a more practical note, due 
to the ubiquity of the Coordinate Descent algorithm, there are many libraries that 
facilitate its implementation. 
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4.2 Parameter Optimization 


This reconstruction scheme incorporates many hyper-parameters; the most relevant 
ones are: the block length B, the sampling ratio v, and the regularization parameter 
jt. All of which can greatly affect the performance and complexity of the algorithm. 
In order to find the optimal set of parameters, we have performed an extensive 
scan over a wide range of them using simulated data with a wide range of noise 
levels. The metric that was used to evaluate the performance of the algorithm was 
the conventional root-mean-square error (RMSE) defined as: 


1 a . 
RMSE(@,1) = | —( Gi — x)? + Gi - 0?) (10) 
i=0 


Where JN is the length of the full unsegmented path, r is the full noise-free 
path expressed in the (x;, y;) Web Mercator coordinates and all variables with a 
circumflex represent reconstructed quantities. When it comes to finding the optimal 
block length, there is a trade-off to be made between complexity and performance. 
It should be noted that, in evaluating the performance of the algorithm we consider 
the deviation of the full reconstructed path and not just of the individual blocks. 
Intuitively, the higher the sampling ratio, the better the performance since more 
samples are included and the CS scheme becomes just a denoising operation. 
However, this is not desired as the main application is sub-sampled reconstruction, 
therefore the goal is to find the minimum possible sampling ratio that still allows 
a reasonable reconstruction. Clearly, the shorter the block length, the faster each 
block-wise iteration is. Additionally, less storage is required and any inverse 
transformation is less complex. However, the block length affects the @; pseudo- 
sparsity and thus the CS reconstruction. 

The scatter plot in Fig.5 shows the RMSE across different sampling ratios and 
block lengths at a specific added noise level o, and regularization parameter jj. 
One can see that the best achievable performance for the lowest sampling ratio is 
B ~ O(107) with v ~ 0.2. Since the DCT is most efficient with a block length that 
is a power of 2, we choose B = 256. 


4.3 Adaptive Sampling 


In order to improve the previously mentioned Bernoulli sampling scheme, while 
keeping its random nature, a heuristic algorithm, shown above in Algorithm 1, 
was developed to adjust recursively the drawing probability in order to reduce the 
noise injected in the reconstruction algorithm. We define a relative error €, = 
min(d¢/Omax, 1), where Omax is a predefined error hyperparameter that can vary 
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Fig. 5 Scatter plot of RMSE for different block lengths and sampling ratios at og = 20m and 
pw = 0.01 


depending on the application. €, is then compared to an error threshold f in order 
to decide on the incoming signal quality. 

The main assumption behind this noise adaptive sampling is that if a certain data 
point was sampled with a high noise level, this means that upcoming data points are 
also likely to be the same. This is because sources of significant noise usually evolve 
on a slower timescale than human motion, e.g., long tunnels or weather disturbances. 
As such, the algorithm then reduces the probability of sampling the next point based 
on how high the noise was. To account for the fact that the source may be temporary, 
the probability then grows until it regains its original value v, as long as no points 
have been sampled since. The algorithm does the opposite process in the cases where 
the noise is unusually low. This ensures that the sampling protocol favors points with 
a lower error while maintaining a stochastic nature. 


5 Neural Networks 


5.1 Introduction 


As the previous section demonstrates an iterative algorithm, which can prove to be 
quite computationally taxing on some embedded systems, we turn to an algorithm 
based on neural networks. Neural networks are a class of computing paradigms that 
are loosely modeled after the operation of the animal brain in hopes of emulating its 
processing ability. They have found tremendous success in a multitude of machine 
learning problems including, but not limited to, computer vision, voice recognition, 
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Algorithm 1 Noise Adaptive Sampling Algorithm 


Input: v € [0, 1]: Sampling ratio 
6 € (0, 1]: Forgetting factor 
€,(n| € [0, 1]: Received error for the sample 
B € (0, 1]: Error threshold 
N € (0,o°[: Path length 
Output: b[n] € {0,1}: Whether to request a data sample 


begin 

1 y|0] <— 0 

2 b{0] <— rand(0, 1, v) 

3 és <— 0 

4 n¢— 1 

5 while n < N do 

6 if b[n — 1] = 1 then 

7 Es <— &[n—- 1] 

8 y|n| <— 1 
else 

9 y|n| <— Sy[n— 1] 
end 

10 if ¢, > B then 

u Pin] = 1 —€sy[n])v 
else 

12 pin] =v +(1—v)(1—€s)y[n] 
end 

13 b[n] <— rand(0, 1, p[n]) 

14 n¢+—n+l 

end 
end 


reinforcement learning, and artificial intelligence [26]. Generally, the network is 
composed of a collection of nodes, called neurons, grouped in layers where some, 
or all, of the neurons of one layer are connected to the neurons in the following layer 
via weights that determine how the output of the former affects the input of the latter. 
A much more detailed overview of neural networks, their operating principles, and 
their applications can be found here [27, 28]. 

To tackle this specific CS application, we have framed the problem of recon- 
structing location data as a supervised learning one, specifically supervised regres- 
sion. This means that the goal of the network would be to find the optimal set 
of weights which would reconstruct the full block from the down-sampled one. 
Although neural networks are commonly used in classification applications [29], 
they have also proven to be effective in some regression problems [30, 31]. 

The principal challenge of this scheme was to find the most optimal network 
design to achieve the best reconstruction. The input and output layer are, of course, 
fixed by the length of the down-sampled and full block, respectively. Other than that, 
the width (size of layers) and depth (number of layers) of the network as well as the 
number of dropout layers (layers that randomly deactivate a certain proportion of 
neurons) and the type of activation layers (which add non-linearities to the output 
of layers) needed to be optimized. 
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There are two main advantages to the neural network scheme. For one, it does 
not require us to specify a sparsifying matrix or a measurement matrix. The only 
input to the protocol is the down-sampled block. Additionally, although neural 
networks can take a long time to train depending on the size of the network and 
the amount of training data, they are quick to run and they can be very efficiently 
parallelized since computing the output of a neural network basically amounts to 
matrix multiplication. This makes them quite ideal for embedded systems, as the 
training can be done remotely on dedicated machines and the weights can then be 
transferred to the device where they are used to achieve the reconstruction. 


5.2 Network Design and Optimization 


As mentioned above, neural networks contain a lot of variables that needed to 
be optimized. Other than the parameters controlling the main architecture of the 
network, one also needs to choose the appropriate training algorithm, activation 
layers (which add non-linearities to what would otherwise be a purely linear 
operation) and optimize any dropout regularization (where certain random neurons 
are deactivated during training to prevent overfitting). 

The simulated location data was split into training data which was used to train 
a specific set of parameters, and validation data which was used to evaluate their 
performance based on the previously mentioned RMSE metric. 

The optimal network was found to contain 10 hidden layers of mostly alternating 
sizes, shown in Fig.6. The bottleneck layers serve the same purpose they do in 
autoencoder networks, where they reduce the dimensionality of the input data and 
as such are quite useful for CS purposes [32]. The size of the input and output 
layer are of course fixed by the chosen block length B and sampling ratio v. Again, 
we have chosen B = 256 with v ~ 0.2, this reduces the size of the network 
and thus its runtime during training and inference. The chosen activation layer 
was the leaky ReLU function [33] which allows for both negative and positive 
outputs which is, of course, essential for this particular regression problem. The 
leak parameter a €]0, 1[ also needed to be optimized to find the best possible 
performance. As for the convergence scheme, tests have shown that the most optimal 
performance/efficiency is obtained by using the Adam optimization algorithm [34] 
with a = 0.8. It should be noted that only one neural net is used for both latitude 
and longitude reconstruction. 


5.3 Adaptive Sampling 


The advantages of neural networks come at a cost. Once a network design is chosen 
and trained, it cannot be altered without retraining it. This means that the size of the 
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Fig. 6 The optimal neural network architecture design showing the alternating hidden layers and 
their activation functions. 10% Dropout layers have been used during the training but are not shown 
in this diagram 


down-sampled block is fixed as well as full block length. In other words, unlike the 
LASSO scheme, v and B need to be predetermined and fixed. 

This rigidity means that, not only can one not use the noise adaptive sampling 
algorithm, but one also cannot even use the simple Bernoulli algorithm as the 
inherent variance in the random number generation process means that the effective 
sampling ratio is not deterministic. 


Algorithm 2 Ratio Adaptive Sampling Algorithm 


Input: v € (0, 1]: Sampling ratio 
b € {0,1}: Draw outcome 
B € (0, -[: Block length 
Output: b[n| € {0, 1}: Whether to request a data sample 


begin 
1 b{0] <— rand(0, 1, v) 
2 X+<—0 
3 n<— 1 
4 while n < B do 
5 X<—L+b|[n—-1] 
6 pln] — (v —/B)(B/(B—n)) 
7 b[n] <— rand(0, 1, p[n]) 
8 ni—n+l1l 

end 


end 
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In order to circumvent this issue, we have developed a ratio adaptive sampling 
algorithm is shown in Algorithm 2 which ensures that the effective sampling ratio 
will be the same as the requested one. The number of points sampled in a block (2) 
is continuously monitored and used to weight the probability of sampling the next 
point so as to ensure that the final sampling ratio (v) is completely deterministic no 
matter the random number generation algorithm. 


6 Numerical Results 


6.1 Simulation Results 


In order to test the aforementioned CS schemes and gather statistics on their 
absolute and relative performance, we have generated a large amount of simulated 
location data using the random walk model (RWM) detailed in Section II.b. The 
noisy simulated paths are each split into blocks, then down-sampled using one 
of the above-mentioned sampling algorithms and fed through to the different 
reconstruction algorithms. The blocks are then recombined using the OCS protocol 
to reform the initial simulated paths. The performance of the algorithm is gauged 
using the RMSE metric which calculates “how far" the reconstruction is from the 
true noise-free data. The RMSE is then averaged over all the different data points to 
show the average performance of each algorithm. 

To have a baseline performance to compare to, the RMSE is also computed for 
the full noisy path with respect to the noise-free data. This baseline estimates the 
level of accuracy attainable if one were to completely sample the path using a 
noisy detector. Figure 7 shows this baseline across different noise levels as well 
as the RMSE of the reconstruction achieved by both schemes (LASSO and neural 
networks (NN)) from a down-sampled path. Both algorithms are able to outperform 
the baseline across a wide range of noise levels except for the case of unrealistically 
low noise (<3 m). This is as expected since, as was shown earlier, true location data 
is sparse in the DCT domain with the sparsity decreasing as the added noise level 
increases. This means that the CS will filter the noise when reconstructing a path 
from down-sampled noisy data as it tries to find the most sparse solution. 

It should be noted that the results in Fig.7 were obtained while using the 
regular Bernoulli sampling algorithm for the LASSO scheme and the ratio adaptive 
sampling algorithm for the NN. To show the advantages of using the noise adaptive 
sampling algorithm with the LASSO reconstruction, the noise profile added to the 
path needed to be modified to emulate the effect of temporary high-noise sources. 
To do so, instead of adding noise that is sampled from a normal distribution with a 
constant o,, each block is split randomly into random-sized “chunks” where each 
“chunk” has a different noise level that can range from very high to very low. 
Although the noise adaptive sampling has a set target sampling ratio v (as can be 
seen in Algorithm 1), the dependence on the noise level means that the sampling 
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Fig. 7 Plot showing the relative performance of both CS schemes compared to noisy complete 
sampling across a wide range of noise levels with v ~ 0.2, N = 20480, B = 256 with a buffer 
zone of 32 samples for the OCS scheme 


ratio can vary significantly. Therefore, in order to accurately gauge the gain in 
performance attained by using the noise adaptive sampling, we have compared it to 
the regular Bernoulli sampling with multiple different sampling ratios. A simulation 
with adaptive sampling parameters 5 = 0.1, B = 0.5, and og, = 50m, and with 
noise chunks ranging from 0m to 100m was run. It showed an improvement of 
the performance of the reconstruction from the adaptive sampling by about 50% 
as compared to a purely random sampling that had the same target sampling ratio 
(v = 0.2) and by about 25% as compared to a random sampling that had the same 
effective sampling ratio (v ~ 0.4). 


6.2 Real Data Experiment 


A mobile application has been developed for iOS devices as a proof of concept 
for the applicability of the algorithm in real-world scenarios. The Core ML and 
Accelerate frameworks from Apple Inc. were used to efficiently implement LASSO 
and NN. As before, the block length remains B = 256, the learning rate uw = 0.01 
and v = 0.2. The Bernoulli sampling and the ratio adaptive sampling routines were 
used for the LASSO and NN reconstruction, respectively. Figure 8 shows the live 
reconstruction achieved by the CS algorithms on a path made up of N = 3584 
sample points. Both the full path and the down-sampled points were recorded to 
allow for the evaluation of the performance of the algorithms ex post facto. As can 
be clearly seen in the figure, both algorithms managed to reconstruct the full path 
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Fig. 8 Reconstruction of a down-sampled path using the LASSO (in red) and NN (in green) 
compared to the noisy fully sampled path (in blue). The latter represents a real walk around Zurich, 
Switzerland, made up of 3584 data points sampled each second 


to a great degree of accuracy. The estimated RMSE for the reconstruction were 
(10 + 8) m and (28 + 8) m for the LASSO and NN, respectively. It should be noted 
that the uncertainty in the RMSE comes from the fact that we do not have access 
to the true noise-free positions and can only compare the reconstruction to what is 
inherently a noisy fully sampled path. 


7 Future Work 


All the algorithms demonstrated thus far in this chapter treat the latitude and lon- 
gitude coordinates as independent from each other. One possible improvement that 
could enhance the performance would be to consider a joint reconstruction scheme 
that uses the correlation between A[n] and ¢[n] to add additional constraints to the 
CS schemes. However, we expect such solution to be much more computationally 
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expensive due to the higher dimensional space. Another improvement would be 
to set up a method to acquire the true noise-free path while performing the live 
reconstruction in order to be able to measure the true RMSE even in cases of 
high noise. This can be done by using multiple industrial-grade GPS receivers 
simultaneously, or just by manually recording the coordinates of each sampling 
point which is a more accurate (and tedious) method. Another point where this 
scheme can be improved is by considering the effects of elevation. In this chapter, 
we have only considered 2D trajectory data. However, there are many regions where 
due to the nature of their terrain, the elevation of the tracked object can change 
significantly over the course of its trajectory. The advantage of our approach is that 
since each coordinate was treated independently, extending the treatment to include 
elevation is a relatively simple task. 


8 Conclusion 


In this chapter, we tackled the problem of path reconstruction for IoT devices. For 
the first time, We have exploited the sparsity of location data in a particular domain 
to reformulate the problem as that of regularized regression. Using that, we have 
designed and implemented two novel solutions to trajectory reconstruction: LASSO 
reconstruction and a neural network model. Each has its set of advantages that make 
them suited for particular applications, for example, the neural network approach 
is very computationally light which makes it perfect for embedded systems. An 
OCS algorithm has been integrated in each scheme to limit the reconstruction to a 
parametric block length for a more efficient and feasible implementation. For each 
protocol, an adaptive sampling technique has been developed to efficiently down- 
sample the path and improve the performance of the algorithms. In both cases, they 
have been shown to successfully reconstruct the original path with ~20% of the 
samples. Furthermore, they outperform the simple scheme of fully sampling the path 
due to their denoising capability. Finally, a mobile application has been developed 
as a proof of concept and to validate the theoretical assumptions that were made and 
the applicability of the reconstruction algorithm in real-world scenarios. 
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1 Introduction 


We consider a constrained optimization problem as follows: 


min f(x) 
St. g(x) <0,Vi=1,2,..., pD, 
h(x) =0,Vi = 1,2,...,9, (1) 


A; (x) > 0,Vi = 1,2,...,7, 
GHA) SOE S 1,23 009th 


where all the functions f, g;,4;, Hj, Gj : X — R are real valued functions defined 
on a Banach space X. Let the set of all feasible points 2 be given by 


Ome eX: Bix) = 0, Vie 1, 2ces 2, 
hi(x) =0,Vi = 1,2,...,4, 
H,(x) > 0, Vi = 1,2,...,7, 
Gi(x)H;j(x) < 0, Vi =1,2,...,r}. 


When X := R” and the functions f, g;,;, G;, H; are continuously differentiable, 
the problem (1) is called a mathematical program with vanishing constraints 
(MPVC) which was introduced in [2] and constitute a new class of difficult 
optimization problem with important applications in topology design of mechanical 
structures. 

Vanishing constraints usually violate standard constraint qualifications, like the 
linear independence constraint qualification and the Mangasarian—Fromovitz con- 
straint qualification, whereas the Abadie and the Guignard constraint qualifications 
need to be modified under suitable assumptions. In order to obtain reasonable 
necessary and sufficient optimality conditions under very weak assumptions several 
MPVC-tailored constraint qualifications were developed in [2, 14-16] and further 
studied for the exact penalty results [17], the sensitivity and relaxation methods [21], 
and the numerical algorithms [19, 20]. 

The MPVC is closely related to the more commonly known mathematical 
program with equilibrium constraints (MPEC) as follows: 


min f(x) 

st. g(x) <0,Vi = 1,2,...,p, 
hi(x) =0,Vi = 1,2,...,4, 
G,(x) => 0, A(x) > 0,Vi = 1,2,...,7, 
G(x) Aj(x) = 0, Vi = 1,2,...,7. 
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We refer to the monographs [30, 43] and the results from MPEC literature [5, 6, 
11, 38-40, 42, 43] for more details. It was shown in [2] that a MPVC can always 
be reformulated as an MPEC, but it increases the dimension of the problem and 
this reformulation is not unique. Moreover, it does not take into account the special 
structure of the MPVC and hence it is worth studying the MPVC directly. 

In the MPVC literature, most of the results obtained under the assumption that the 
objective and the constraint functions involved are continuously differentiable can 
be found in [2, 13, 14, 16-22, 27, 34, 35], whereas recently some significant works 
have been done which deal with nondifferentiability in the special structure of the 
MPVC [24—26, 36]. It is well known that differentiability and Lipschitz continuity 
are two different kinds of assumptions and may not imply each other in general. 
Hence, for an MPVC with mixed assumptions of differentiability and Lipschitz 
continuity, the nonsmooth KKT type necessary optimality conditions can be derived 
using the tools of nonsmooth analysis (see, e.g., [23, 28, 29, 37, 47—49]) by replacing 
the usual gradient by certain generalized gradients under Lipschitz assumptions. 

The purpose of this paper is to provide a nonsmooth KKT type necessary 
optimality condition for the MPVC (1) with mixed assumptions of Gateaux 
differentiability, Fréchet differentiability, Hadamard differentiability, and Lipschitz 
continuity. Since the Michel-Penot (M-P) subdifferential is the smallest one among 
various convex valued generalized gradients which coincide with the usual deriva- 
tive when a function is Gateaux differentiable, we aim to provide a Lagrange 
multiplier rule in terms of the M-P subdifferential. 

The overview of this chapter is as follows. In the next section, we provide the 
preliminary definitions and the results which will be used in the sequel. In Sect. 3, 
we give a suitable representation of the linearized cone for the NMPVC (1) and use it 
to define the Abadie constraint qualification for the NMPVC (1) which further used 
to derive a nonsmooth KKT necessary optimality condition for the NMPVC (1). 
In Sect. 4, we give various modifications of the Cottle constraint qualification, the 
Slater constraint qualification, the Mangasarian—Fromovitz constraint qualification, 
and the linear independence constraint qualification taking into account the special 
structure of the NMPVC (1), establish relationships among them, and summarize 
the results obtained. In Sect. 5, we conclude the results obtained in this chapter. 


2 Preliminaries 


In this section, we recall some known definitions and results which will be used in 
the sequel. 


Definition 1 Let X and Y be two Banach spaces, let x* € X, and let f : X > Y. 
The usual directional derivative of f at x* ina direction v € X is given by 


Le ya FG" +10) — f*) 
fv) = lim 
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when the limit exists. If there exists Df (x*), an element of the space L(X, Y) of 
continuous linear functionals from X to Y such that for every v € X, f (x*; v) = 
(Df (x*), v) , where (-, -) denotes the canonic pairing, then the function f is said to 
be Gdateaux differentiable at x*. 

The function f is said to be Hadamard differentiable at x*, iff Df(x*) € 
L(X, Y) and, for every v € X, one has 


km | LO tt) — FO") 
im 


tlO,u>v t 


= (Df (x*), v). 


The function f is said to be Fréchet differentiable at x*, iff Df(x*) € L(X, Y) 
and the convergence in 


f (x*; v) = lim fie y= =( 


t|0 t 


Df (x*), v), 


is uniform with respect to v in bounded sets. 


The Fréchet differentiability is stronger than the Hadamard differentiability, 
which in turn is stronger than the Gateaux differentiability. We refer to the books 
[10, 46] for relationships among various differentiability properties and more details 
related to nonsmooth analysis. 

The following concepts of generalized directional derivatives will play an 
important role in the subsequent analysis. 


Definition 2. The Clarke generalized directional derivative of a function f : X > 


R at any x* € X ina direction v € X is defined as follows 


f°(x*; v) := limsup ff + tv) - F@) 


x—>x*,t10 t 


The Clarke generalized gradient of the function f at x* is given by 
O° f (x*) = {E € X*: (Ev) < f°(*; v), We X}. 


The following concept of the Michel-Penot (M-P) subdifferential was introduced 
in [32]. 
Definition 3 The Michel-Penot directional derivative of f at x* in a direction v € 
X is given by 
*4f - “4 ft 
f°(x*; v) := sup lim sup tA aloha) eer ie an 
weX 10 t 
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The Michel-Penot subdifferential of f at x* is given by 
0° f (x*) = {E © X*: (E,v) < fo"; v), Wu € Xf. 


It is clear that f (x*: = Poa) = Pe ww ]€ X and a7 fa") Cc 
a° f (x*) C a° f (x*), where 0? f(x*) is the Dini subdifferential of f at x* defined 
as 


a? F(x*) = {é EX*: (Ev) < f' (xv), We x} 


The concept of semiregularity was introduced in [8] analogous to Clarke 
regularity [10] for the Lipschitz continuous functions which was further extended to 
any Gateaux differentiable function in [49] and known as M-P regularity. 


Definition 4 A function f : X — R is said to be M-P regular at x* € X, iff 
the usual directional derivative f (x*; v) exists and f (x*; v) = f°(x*; v) for all 
ve X. 


Here, we give some basic properties of the M-P directional derivatives and M-P 
subdifferentials. We refer to [8, 32, 33, 48, 49] for motivations and references. 


Proposition 1 Let X be a Banach space, let x* € X, and let f,g : X > R 
be either Gdteaux differentiable at x* or Lipschitz near x*. Then, the following 
properties hold: 


(i) The function v > f°(x*; v) is finite, positively homogeneous, and subadditive 
on X; 

(ii) For any scalar i, 0° (Af )(x*) = 240° f (x*), and for any v € X, f°(x*; -v) = 
Kf); 

(iii) a°( fg") SFO s@*) + g@)I LO) and (fg); 0) < 
f (x*)g? (x*; v) + g(x*) fo (x*; v). The equalities hold if both f and g are 
M-P regular at x*, f (x*) = 0, g(x*) = 0, and fg is M-P regular at x*; 

(iv) 0° f (x*) is a nonempty convex, weak*-compact subset of X*, and for every 
v € X, one has f°(x*; v) = max { (&, vy :€e a° f(x*)} : 

(v) If x* is a local minimum of f, then 0 € 0° f(x*) and f°(x*; v) = 0 for all 
ve Xx. 


The notion of pseudoconvexity and pseudoconcavity was introduced in [49] to 
allow nondifferentiability in terms of the M-P subdifferentials. 


Definition 5 Let f be a real-valued function defined on a Banach space X. The 
function f is said to be M-P pseudoconvex at x* € X, iff for all x € X, one has 


LP" x —x*) 20> fx) = Ff"). 
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The function f is said to be M-P pseudoconcave at x* € X, iff for all x € X, one 
has 


Pe eae) SUS fx"): 


The following concepts of quasiconvex and quasiconcave functions will also be 
used in the sequel. 
Definition 6 Let f be a real-valued function defined on a Banach space X. The 
function f is said to be quasiconvex at x* € X, iff for all x € X, one has 
f@) < f@"),0<’<1=> f@*+AG—x")) < FO"). 
The function f is said to be quasiconcave at x*, iff — f is quasiconvex at x*. The 
function f is said to be quasiaffine, iff it is both quasiconvex and quasiconcave. 


We refer to [9, 31] for more details and applications of generalized convex 
functions. Now, we recall the notions of the contingent cone, the Clarke tangent 
cone, the Clarke normal cone and the polar cone [3, 4, 7, 41, 45]. 


Definition 7 Let 22 C X and x* € cl. The contingent cone of @ at x* is the 
nonempty closed cone defined by 


K(2,x*):= {v € X: {un} CX, un > v, aty | Or x* + tpn € 2,Vn}. 
The Clarke tangent cone of Q at x* is the nonempty closed cone defined by 
T(Q,x*) = {VEX iV {xn} CQ, xn > x,Vty | 0, dun > Vi Xn + InUn € 2,Vn}. 
The Clarke normal cone to 92 at x* is defined by 
N(Q; x*) := {& € X*: (E,v) <0, Vu € T(2; x*)}. 

The cones T (2; x*) and N(&2; x*) are convex, while K (§2; x*) is not necessar- 
ily convex, and T(Q; x*) C K(Q; x*). 

For any set A C X, the polar cone of A is given by 

Ao := {& € X*: (&,v) <0, We A}. 


Now, consider the optimization problem as follows: 


min f(x) 
st. & (x) <0,Vi=1,2,...,p, (2) 
hi(x) =0,Vi = 1,2,...,9, 
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where all the functions f , Bi, hi : X > Rare locally Lipschitz on a Banach space 
X. Let the set of all feasible points of the optimization problem (2) be given by 
Q:={xeEX: E(x) <0,Vi=1,2,..., fp, 
hj(x) = 0,Vi = 1,2,...,g}. 


Also, consider the following index sets 


Ip = {i € Hic pliae y= 0}; 
T= {1,...,qg}. (3) 
Here, we give the nondifferentiable analogs of some constraint qualifications in 
terms the M-P subdifferentials that may be found in the literature [1, 31, 44, 47-49]. 


Definition 8 Let x* € 2. The nonsmooth Abadie constraint qualification (NACQ) 
is satisfied at x*, iff g;(i € Ig,) and hj(i € 1 ;,) are either Gateaux differentiable at 
x* or Lipschitz near x*, the convex cone generated by 


A= Ua eo yu Uemioru U [ae] 
iel iel; ich; 
is closed and 
LG) SG Ke"), 
where 


L(@: x") = {v EX: B(x": v) <0, Wi € Uy, A2(x*; v) =0,Vi € i; 


denotes the corresponding linearized cone of Q at x*. 


Definition 9 Let x* € 9. The nonsmooth Mangasarian—Fromovitz constraint 
qualification (NMFCQ) is satisfied at x*, iff g;(i € Jg) are either Hadamard 
differentiable at x* or Lipschitz near x*, g;(i ¢ Jg) are continuous at x*, 


h i@i € J;) are Fréchet differentiable at x* and continuous in a neighborhood of 


x*, Dh (x*),..., Dhj (x*) are linearly independent, and there exists v € X such 
that 


8? (x*; v) < 0, Vi € Tg, 


(Divx), v) =0,Viek. 
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Definition 10 Let x* € Q. The nonsmooth linear independence constraint qualifi- 
cation (NLICQ) is satisfied at x*, iff g;(i € I gz) are either Hadamard differentiable 
at x* or Lipschitz near x*, g;(i ¢ Jz) are continuous at x*, hj(@i € Jj) are 
Fréchet differentiable at x* and continuous in a neighborhood of x*, and for any 
E* © 9°83: (x"), {erc € 1g), Diy (x*),..., Dhg(x*)} are linearly independent. 


Definition 11 Let x* € 2. The nonsmooth Slater constraint qualification (NSCQ) 
is satisfied at x*, iff g;(i € Tz) are M-P pseudoconvex at x* and either Hadamard 


differentiable at x* or Lipschitz near x*, g; (i ¢ Jz) are continuous at x*; hii el f) 
are Fréchet differentiable at x* and continuous in a neighborhood of x*, and 
quasiaffine at x*; | Diu(x*), Steed Dhg «| are linearly independent, and there 


exists * € X such that 
B(x) < 0, Vi € Ig, 
hj(&) = 0, Vi € J. 
The following nonsmooth KKT necessary optimality condition was developed in 


[49] which gives a generalized Lagrange multiplier rule for the problem (2) in terms 
of the M-P subdifferentials. 


Theorem 1 Let x* € Q be a local minimum of the problem (2). Let any one of the 
constraint qualifications as in Definitions 8-11 holds at x*. If f is either Fréchet 
differentiable at x* or Lipschitz near x*, then the KKT conditions in terms of the M- 
P subdifferentials hold at x*, that is, there exists Lagrange multipliers a; € Ri (i € 
Tg), Bi € R@ € J) such that 


0 a° f(x*) + D> ad? Bi(x*) + D> Bid hj (X*). 
ielg iel;, 


For the continuously differentiable case the above nonsmooth KKT type neces- 
sary optimality condition will reduce into the classical KKT necessary optimality 
condition (see, e.g., [3, 12, 31, 41]). 


3 Necessary Optimality Conditions 


To derive the nonsmooth KKT necessary optimality conditions for the NMPVC (1), 
we introduce the following index sets at a local minimum x* of the NMPVC (1) 
which will be used frequently in the subsequent analysis: 


I, = {i € {1,2,..., p}: gi(x*) = 0}, 
Th, = {1,234.55}, 
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I, := {i € {1,2,...,r}: Hj(x*) > O}, (4) 
Ip := {i € {1,2,...,r} : Hj(x*) = O}. 
The index set J+ can be further divided as follows: 


Ino = {i € {1,2,...,r} : Hi(x*) > 0, G;(x*) = 0}, 
In = {i € (1,2,..., 7}: HG) > 0, G(x) < 0}. (5) 


Similarly, the index set Jp can be partitioned as follows: 


Tog = {i € {1,2,..., 7}: Hi(x*) = 0, Gi(x*) > 0}, 
Too = (i € {1,2,...,7} 2 Hi(x*) = 0, G;(x*) = 0}, (6) 
In = {i € {1,2,...,r} : Hi(x*) = 0, Gi(x*) < 0}. 


Also, consider the following function 
6; (x) := G(x) Aj (x), Vi = 1,...,7. (7) 
The M-P subdifferential of 6; at x is given by 
0°O;(x) © Gj (x)d° Aj (x) + Hj (x)d°G;(x), Vi = 1,...,7. (8) 


Let the functions Hj (i € Io9 U Io+), Gi (i € Igo U 140) are M-P regular at x* and 
Io_ = ®. Then, the definition of index sets (4)-(6) implies 


{0} if i € Igo, 
0°6;(x*) = { Gi(x*)8°Hi(x*) if i € Ip4, (9) 
Hy (x*)a°Gi(x*) if i € Tyo. 


The following result is a nonsmooth analogue of [2, Lemma 2] which shows that 
under reasonable weak assumptions the NLICQ of Definition 10 is not satisfied for 
the NMPVC (1). 


Lemma 1 Let x* be a feasible solution of the NUPVC (1) such that Ip 4 ®, In = 
®, and H;(i € Too U In4), GiGi € 140) are M-P regular at x*. Then, NLICQ is 
violated at the point x*. 


Proof Suppose that NLICQ is satisfied at x*. Then by Definition 10, the functions 
sili € Ig), Hi(i € Ip), 0; € Ip U 140) are either Hadamard differentiable at 
x* or Lipschitz near x*, the functions g;((@i ¢ Ig)), Hii ¢ Io), iG ¢ Io U I40) 
are continuous at x*, the functions h;(i € I;,) are Fréchet differentiable at x* and 
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continuous in a neighborhood of x*, and the subgradients 


EF € O° gi(x*)G € Ip), 
Dhj(x*)(i € In), 
EM © 0° Hi (x*)G € Too U Toy), 
EP © 8°O;(x*)G € Ino U Ing U Ly) (10) 


must be linearly independent. Since I) # ®, we have [99 4 ® or Ing F O®. 
However, for i € Ion, we get 0 € 0°6;(x*) which cannot be a member of a set 
of linearly independent vectors. On the other hand, if i € Jo+, it follows from (9) 
that any g6 € 0°6;(x*) is a nonzero multiple of some er € 0°H;(x*) and hence 
g? together with the corresponding gH forms a linearly dependent subset of vectors 
from (10). These contradictions show that NMFCQ is not satisfied at x*. oO 


The following result is a nonsmooth analogue of [2, Lemma 3] which shows that 
under a reasonable assumption the NMFCQ does not hold for the NMPVC (1). 


Lemma 2 Let x* be a feasible solution of the NUPVC (1) such that Ip9 U Io+ 4 
®,Ip_ = ®, and Hi(i € Ipo U In+), GiGi € 140) are M-P regular at x*. Then, 
NMFCQ is not satisfied at x*. 

Proof Suppose that NMFCQ is not satisfied at x*. Then, by Definition 9, the 
functions g;(i € I), Aj (i € Io), 0; € Io UL40) are either Hadamard differentiable 
at x* or Lipschitz near x*, the functions g;(i ¢ I,), Hi(i ¢ Io), :@ ¢ Io U Tyo) 
are continuous at x*, the functions h;(i € J;,) are Fréchet differentiable at x* and 
continuous in a neighborhood of x*, {Dh;(x*) : i € I,} are linearly independent, 
and there exists v € X such that 


8; (x*; v) <0, Vi € Ig, 

(Dh; (x*), v) = 0, Vi € In, 

H? (x*; v) > 0, Vi € Io, 
6°(x*:v) <0, Vi € Ip U Iho. (11) 


Suppose that i € Joo, from (9), one has, aa v) = 0, a contradiction to (11). 
Again, suppose that i € Jp;, then from (9), one has 


_ 1 
Gi (x*) 


Be ix" sv) G(x" su) <0, 


again a contradiction to (11). These contradictions lead to the fact that NMFCQ is 
not satisfied at x*. Oo 


Lemma 3 Let x* € 2 be a local minimum of the NMPVC (1). Let gi(Gi € 
Ip), hiGi € In), HiG € Joo U Io+), GiGi € Iy0) be either Gateaux differentiable 
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at x* or Lipschitz near x*. Let Ip_ is empty and H;(i € Ipo U Io+.), GiGi € I40) are 
M-P regular at x*. Then, the linearized cone of the NUPVC (1) to 2 at x* is given 
by 


L(2;x*)={vEX: g?(x*;v) <0, Vi € Ig, 
h? (x*; v) =0, Vi € In, 
H?(x*; v) = 0, Vi € lox, 
H?(x*; v) = 0, Vi € loo, 
G?(x*; v) < 0, Vi € Lo}. 
Proof Suppose that 6; denotes the function from (7). Then, using the definition of 
the index sets from (4)-(6), it follows that the linearized cone of the NMPVC (1) to 
@ at x* is given by 
L(Q;x*)={vEX: g?(x*;v) <0, Vi € Ig, 
h?(x*; v) = 0, Vi € In, 
H? (x*; v) = 0, Vi € Too U Io+, 
B(x" v) < 0, Vi € Ino U Ing U Ty}. 
Now, using the expression of the M-P subdifferential 0°6; (x*) fori € Ig9UJo,ULi0 


as given in (9), we get the required representation of the linearized cone of the 
NMPVC (1) to 2 at x*. Oo 


In the next theorem, we derive a nonsmooth KKT necessary optimality condition 
for the NMPVC (1) under certain assumptions when the NACQ is satisfied at a 
local minimum x* € Q of the NMPVC (1). For the continuously differentiable 
case the following theorem reduces to [2, Theorem 1] and could be considered as a 
generalization of [49, Theorem 3.1] for the NMPVC (1). 


Theorem 2 Let x* € Q be a local minimum of the NMPVC (1) such that NACQ 
holds at x*, and f is either Fréchet differentiable at x* or Lipschitz near x*. Also, 
let In = ® and Hi(i € Ion U Io), Gi i € 140) are M-P regular at x*. Then, there 
exist scalars a; € Ry(@i € Ig), Bi € RG € In), Yi € ReGi € Too), nf? e RG e 
Io+), ng € R4(i € 49) such that 


Oe a f(x") +) a9? gi(x*) + D> Bid?hi(x*) 


icl, ielp 


— 0 18°") — Yo nema) + >° nf a°Gia”). 


i€log ielo, iel4o 
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Proof Under the assumptions of the theorem, by Theorem |, it is easy to show that 
there exist scalars a; € Ry(@i € Ig), Bi € RG € In), Vi € R+@ € Ip U Jo+), pi € 
Ri@ € Igo U Io4. U L409) such that 


OE a f(x*) +>) aid gi(x*) + D> BPH (2*) 


i€lg icln 

° * ° * 
— DO vee) + Do 98°6(x*), 
i€lppVI0+ 1Elp9UIo4UL+40 


where 6; denotes the function from (7). Using the M-P regularity of H;(@i € Ig9 U 
Io4), Gi(@i € Tyo) at x* and setting 


Vi — piGi(x*) = nf Wi € Ios 
and 
pi i(x*) = n&, Vi € +0, 


we get the required nonsmooth KKT necessary optimality condition for the NUPVC 
(1) at a local minimum x*. oO 


Example I Consider the nonsmooth optimization problem 


min f(x) =4/x7 +22 
S.t. Ay(x) := x1 + |x2| = 0, 
G(x) Ay (x) := x1 (41 + |x2|) < 0, 


which is a NMPVC with X := R?, p =q = 0, andr = 1. The set of all feasible 
solutions 2 is given by 


Q= [reR?: x +22 20,2101 +2) $0,020 
Ux eR? : x1 — a2 2 0,.x1t1 — 2) $0,29 < 0}. 


The origin x* := (0, 0) is the unique solution of the problem and Jong = {1} 4 @. 
The tangent cone to 2 at x* is given by 


TQix y= [v €R?: vy + v2 = 0, vi(v1 + 02) <0, vp = Of 


U{v Ee R®: 1 — vw = 0, v1 — w) <0, v2 <0}. 
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Using Lemma 3, the representation of the corresponding linearized cone to §2 at x* 
is given by 


L(Q;x*) = {ve R?: 01 +0) 20, v2 20} U {ve R?: v1 — vp > 0, v2 <0}. 


Here, the linearized cone is strictly larger than the tangent cone and hence NACQ is 
violated at the optimal solution. 


4 Sufficient Conditions for the Nonsmooth Abadie 
Constraint Qualification 


In this section, we consider some constraint qualifications for the NMPVC (1) which 
will serve as a sufficient condition for the NACQ to hold. 


Definition 12 Let x* € © be a feasible solution of the NMPVC (1). Then, 
a constraint qualification of Cottle type for the NMPVC (1), denoted by CCQ- 
NMPVGC, is satisfied at x*, iff g;@ € Ig), Hi(i € Too), GiGi € Io) are either 
Hadamard differentiable at x* or Lipschitz near x*, hj(i € In), Hii € Io+) are 
Fréchet differentiable at x* and continuous in a neighborhood of x*, and there exists 
v € X such that 
g; (x*; v0) <0, Vi € Ip, 
G? (x*; v) < 0, Vi € I40, 
H? (x*; v) > 0, Vi € loo, 
(Dh;(x*), v) =0, Vi € In, 
(DH;(x*), v) = 0, Vi € Toy. 
The following lemma shows that under suitable hypothesis CCQ-NMPVC 
implies ACQ for the NMPVC (1). 


Proposition 2 Let x* be a local minimum of the NMPVC (1) with Ip9 U Ipn- = & 
and such that CCQ-NMPVC holds. Then, the NACQ holds at x*. 


Proof Let v € L(Q; x*). We first show that v € K (2, x*). Since CCQ-NMPVC 
is satisfied at x*, there exists v € X such that 
g; (x*; 0) < 0, Vi € Ig, 
G? (x*; 0) < 0, Vi € Ly0, 
(Dh;(x*), 6) = 0, Vi € In, (12) 
(DH;(x*), 3) = 0, Vi € lox. 
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For any ¢t, | 0, define v, — v by 
Un i= Ut td. 
By positive homogeneity and subadditivity of g?(x*; -)(i € Ig), one has 
8 (xs Un) < BP (x*s v) + tg; (x*; 0) < 0, Vi € Ty. (13) 

Similarly, 

G?(x*; Un) < G?(x*; v) + trG? (x*; 0) < 0, Vi € Iho. (14) 
By Fréchet differentiability of h;(i € I,) and Hj(i € 9+) at x*, one has 


(Dhj(x*), vn) = (Dhj(x*), v) + tr (Dhj(x*), 3) = 0, Vi € In, 
(DH; (x*), Un) = (DH; (x*), v) + t) (DH; (x*), 6) = 0, Vi € Toy. (15) 


Now, for any f, | 0, define x; > x* by 
Xp i= xe tv. 
Now, by (13), one has 


00> eu Un) = g(x"; Un) 


gi(x* + fun) — gi (x*) 


= lim sup 
710 t 
’ gi(x* + tn) — gi (x*) 
> lim sup 7 , 
k->0oo tk 


which implies that 
gi (xz) < 0, Vi € Ig. 
Also, by continuity of g;(i ¢ I,), one has 
gi (x*) <0, Vi ¢ I, and Vk sufficiently large. 
Without loss of generality, we may assume that 


gi(x*) < 0, Vk, Wi € {1,..., p}. (16) 
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Similarly, by (14), one has 

Gi(x*) < 0, Vk, Vi € {1,...,r}. 
From (15), it follows that 


Hj(x* + tun) — Hix") 


0 = A, (x*; v,) = liminf 
ran) 


t 
(y* ms = -(y* 
< liminf Hi (x* + kUn) Hi (x ) 
k-0o tk 


which implies that 
Aj (xx) = 0, Vi € oz. 
By continuity of Hj(i ¢ Jo), one has 
Hj (xx) > 0, Vi ¢ Io, Vk sufficiently large. 

Without loss of generality which implies that 

Hi (xx) = 0, Vk, Vi € {1,...,7} 
By Fréchet differentiability of h;(i € I,), Hi Gi € Io+), one has 

hi (xx) = 0, Vi € {1,...,q}, Vk. 
From (16)-(19), one has 


XE — x* 
xp € 2,Vk, and > Un, 
k 


which implies that 

Un € K(Q; x*), 
and since K (QQ; x*) is closed 

v € K(Q; x"). 


Hence, NACQ holds at x*. 
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(17) 


(18) 


(19) 
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It is clear from Example 1 that CCQ-NMPVC holds at x* := (0,0), but the 
NACQ is not satisfied. Hence the implication of Proposition 2 does not hold if Io9 A 
@. Moreover, the representation of the linearized cone will be effected if Io. 4 ®, 
and hence as a consequence of Theorem 2 and Proposition 2, it follows that the 
nonsmooth KKT conditions are the necessary optimality conditions for the NUPVC 
(1) at a local minimum x* under the assumption that J99 U Ja- = ® and the CCQ- 
NMPVC holds. 

Now, we give a constraint qualification of Slater type for the NMPVC (1), 
denoted by SCQ-NMPVC, which will serve as a sufficient condition for the CCQ- 
NMPVC to hold. 


Definition 13 Let x* € @ be a feasible solution for the NMPVC (1). The SCQ- 
NMPVC is satisfied at x*, iff g;(i € Ip), GiGi € 140) are M-P pseudoconvex at x* 
and either Hadamard differentiable at x* or Lipschitz near x*; g;(i ¢ Ig), Gili ¢ 
I40) are continuous at x*; Hj(i € Jo9) are M-P pseudoconcave at x* and either 
Hadamard differentiable at x* or Lipschitz near x*; H;(i ¢ Joo) are continuous at 
x*; hiGi € In), Hi(@i € Ip) are Fréchet differentiable at x* and continuous in a 
neighborhood of x*, and quasiaffine at x*; {Dh;(x*)(@i € In), DHj(x*)(i € Ip+)} 
are linearly independent, and there exists x € X such that 
gi(X) <0, Vi € Ig, 
G;(x) < 0, Vi € Tho, 


Hi; (x) > 0, Vie€ Too, 

and 
hj(x) = 0, Vi € In, 
Hj (£) = 0, Vi € Ip. 


Proposition 3 The CCQ-NMPVC holds at a local minimum x* € Q if and only if 
SCQ-NMPVC is satisfied at x*. 


Proof Suppose that SCQ-NMPVC holds at x*. Since gj(i € Ig), Gi(i € 140) are 
M-P pseudoconvex at x*, one has 


8; (x*; ¥ — x*) < 0, Vi € Ty, 


G? (x*; & — x*) < 0, Vi € Lo. 
By the M-P pseudoconcavity of Hj (i € Joo), one has 


HP (x*: k —x*) > 0, Vi € Igo. 
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Since hj (i € In), H(i € Ip) are quasiaffine at x*, one has 


(Dhj(x*), & — x*) =0, Vi € In, 
(DH; (x*), & — x*) =0, Vi € Ins, 
which implies that the system in Definition 12 is solvable for v := x — x* € X and 
hence the CCQ-NMPVC holds at x*. 
Conversely, suppose that CCQ-NMPVC holds at x*. By the assumptions in the 
theorem, there exists v € X such that 
g(x"; v) <0,¥i € Ip, 
G,(x*; v) < 0, ¥i € I40, 


H, (x*; v) > 0, Vi € Io, 
and 


(Dh;(x*), v) =0, Vi € In, 
(DH; (x*), v) = 0, Vi € Joy. 


Thus, there exists A > 0 such that 


gi(x* +Av) < 0, Vi € Ip, 
Gi(x* + Av) < 0, Vi € Lo, 
H;(x* + Av) > 0, Vi € Ioo, 


and 


hj(x* +Av) =0,Vi € In, 
Hy (x* +Av) = 0, Vi € oy. 


Hence, the system of Definition 13 has a solution x := x* + Av € X, that is, SCQ- 
NMPVC is satisfied at x*. oO 


The following is a nonsmooth Mangasarian—Fromovitz constraint qualification 
for the NMPVC, denoted by MFCQ-NMPVC, which will serve as a sufficient 
condition for the CCQ-NMPVC to hold. 


Definition 14 Let x* € @ bea feasible solution for the NMPVC (1). The MFCQ- 
NMPVC is satisfied at x*, iff gj@i € I,), GiGi € Tyo), HiGi € Joo) are either 
Hadamard differentiable at x* or Lipschitz near x*; g;(i ¢ Ig), Gi@i ¢ I40), Hid ¢ 
Io9) are continuous at x*; hj(i € In), Hj(i € Ip) are Fréchet differentiable at x* 
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and continuous in a neighborhood of x*, {Dh;(x*)@ € In), DHj(x*)(i € Ip4)} are 
linearly independent, and there exists  € X such that 
g; (x*; 6) < 0, Vi € Ig, 
G? (x*; 6) < 0, Vi € Lyo, 
H?(x*; d) > 0, Vi € Ibo, 


and 
(Dhj(x*), 0) = 0, Vi € In, 
(DH; (x*), 3) = 0, Vi € Joy. 
Remark I For X := R" and for the continuously differentiable case the above 


constraint qualification reduces to VF-MFCQ studied in [2]. 


The following result gives the relationship between the MFCQ-NMPVC and the 
CCQ-NMPVC. 


Proposition 4 Jf MFCQ-NMPVC holds at a local minimum x* of the NMUPVC (1), 
then CCQ-NMPVC is also satisfied at x*. 


Proof Suppose that CCQ-NMPVC does not hold at a local minimum x* of the 
NMPVC (1). Then, by Definition 12, there does not exist v € X such that 
8 (x*; v) < 0, Vi € Ig, 
G?(x*; v) < 0, Vi € yo, 
H?(x*; v) > 0, Vi € Loo, 


and 


(Dh;(x*), v) = 0, Vi € In, 
(DH;(x*), v) = 0, Vi € Toy. 
By the Mortzkin’s theorem of alternative (see, e.g., [31]), there exist aj € Ri (i € 


Ig), ne € Ri @ € J40), 7 € R+@ € Joo), not all zero and 6; € RG € Ih), ne E 
R, @ € Jo) such that 


So aige(x*sv) + SY n& G2; v) — SO HP": v) (20) 
i€l, ielio TElo9 


+ S~ Bi (Dhi(x*), v) — S> nf (DH;(x*), v) = 0, Wu € X. 


iElp, i€lo+ 
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By the assumptions of the MFCQ-NMPVC in Definition 14, it follows that 


Yo aig? (as ) + DO nF GPO"; 0) — SO HPO": 8) =O. 


icl, ielyo i€loo 
By Definition 14, it follows that 
ai (i € Ig) = nP (i € 40) = vii € Io0) = 0. 


Hence, (20) implies that 


YB (Dhj(x*), v) — © nf (DH; (x*), v) =0, Vu € X. 


ielh i€lo+ 
By the linear independence in Definition 14, one has 
Bili € In) =n @ € Io) = 0, 
a contradiction to the existence of not all zero multipliers and this completes the 


proof. Oo 


The following is a nonsmooth linear independence constraint qualification for 
the NMPVC, denoted by LICQ-NMPVC, which will serve as a sufficient condition 
for MFCQ-NMPVC to hold. 


Definition 15 Let x* € 92 be a feasible solution for the NMPVC (1). The LICQ- 
NMPVC is satisfied at x*, iff gif € Ig),Gi(i € I40), Hi@ € Joo) are either 
Hadamard differentiable at x* or Lipschitz near x*; g;(i ¢ Ig), Gi(@i ¢ I10), Aili ¢ 
Io9) are continuous at x*; h;(i € In), Hj(i € Ip+) are Fréchet differentiable at x* 
and continuous in a neighborhood of x*; and the subgradients 

EF © 8° gi(x*), Vi € Ig, 

ES € A°G;(x*), Vi € Tyo, 

EM © a° Hi(x*), Wi € Ibo, 

Dhj(x*), Vi € In, 

DH;j(x*), Vi € Io 


are linearly independent. 
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Remark 2 For the continuously differentiable case and X = R”, the LICQ-MPVC 
reduces to the VC-LICQ introduced in [2] as a modification of the standard LICQ. 
It is easy to see that LICQ-NMPVC implies MFCQ-NMPVC. 


In the following theorem we summarize the results obtained in this chapter. 


Theorem 3 Let x* € @ bea local minimum of the NMPVC (1) with Ip9 U Ip- = 
® and Hj(i € Ip+), GiGi € 140) are M-P regular at x*. Consider the following 
constraint qualifications: 


(i) the ACO-NMPVC holds at x*; 
(ii) the CCO-NMPVC holds at x*; 
(iii) the SCO-NMPVC holds at x*; 
(iv) the MFCQ-NMPVC holds at x*; 
(v) the LICQ-NMPVC holds at x*. 


If f is either Fréchet differentiable at x* or Lipschitz near x*. Then, there exist 
scalars a; € Ry(@i € Ig), Bi € RG € In), vi € Ri € Ino), n#? € RG € 
Io+), qe € R+(i € 140) such that 


Oe a f(x) +) a9? gi(x*) + D> Bid°hi(x*) 


i€l, ielp 
— So ia? Hi(x*) — YO nflaeHi*) + YO nF a°Gi(x*). 
1Elo9 i€lo4 ielio 


5 Conclusions 


In this chapter, we have studied an optimization problem with equality, inequality, 
and vanishing constraints under mixed assumptions of Fréchet differentiability, 
Gateaux differentiability, Hadamard differentiability, and Lipschitz continuity. We 
called this problem as a nonsmooth mathematical program with vanishing con- 
strains (NMPVC) and observed that the nonsmooth analogues of some standard 
constraint qualifications like the Mangasarian—Fromovitz constraint qualification 
and the Linear independence constraint qualification do not hold for the NUPVC 
same as for the continuously differentiable case (see, e.g., [2]) under some fairly 
mild assumptions. We established various modifications of the Abadie constraint 
qualification, the Cottle constraint qualification, the Slater constraint qualification, 
the Mangasarian—Fromovitz constraint qualification, and the linear independence 
constraint qualification taking into account the special structure of the NMPVC. 
The relationships among various constraint qualifications have been established and 
the nonsmooth KKT type necessary optimality condition for the NMPVC has been 
derived in terms of the M-P subdifferential as it is the smallest among all the convex 
valued subdifferentials. 
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